23 KiB
Feature Landscape
Domain: Multi-platform job data crawler framework (Boss直聘 / 前程无忧 / 智联招聘) Researched: 2026-03-21 Context: Refactoring existing crawler system — not a greenfield build. Features are evaluated against what exists and what the refactor must deliver.
Current State Baseline
Before categorizing features, this is what the existing system already has (do not re-build):
| Capability | Implementation | Location |
|---|---|---|
| Keyword-driven crawl dispatch | /api/v1/keyword/available pull model |
app/api/v1/keyword/ |
| Batch data ingestion pipeline | IngestService + PlatformConfig registry |
app/services/ingest/ |
| ClickHouse dedup on insert | batch_dedup_filter() — 1 or 2 column key |
app/services/ingest/dedup.py |
| Random delay anti-detection | sleep_random_between(), 10–20s enforced |
jobs_spider/boss/boos_api.py |
| Proxy rotation + tunnel proxy | SmartIPManager, HTTPClient |
app/core/algorithms/antispider.py, _http_client.py |
| IP anomaly detection | IPAnomalyDetector — 403/429/407, slow response, code=35 |
_http_client.py |
| TLS fingerprint spoofing | requests_go with TLS_CHROME_LATEST, random JA3 |
_http_client.py |
| Structured logging | loguru rotating daily files, 30-day retention |
all modules |
| Scheduled job framework | APScheduler, distributed file/Redis lock | app/core/scheduler.py, locks.py |
| IP upload tracking | IpTrackingMiddleware → IpUploadStats table |
app/core/ip_tracking.py |
| Company cleaning pipeline | CompanyCleaner collect→process→store, every 5 min |
app/services/company_cleaner.py |
| Frontend monitoring (partial) | monitor.vue — pending queue, today processed, completion rate |
web/src/views/cleaning/monitor.vue |
| Unified crawl loop (partial) | spiderJobs/runner/loop.py — page progress reporting |
spiderJobs/runner/ |
| BaseFetcher / BaseSearcher | Template method base classes | app/services/crawler/_base.py |
Critical gap: Two parallel crawler frameworks exist (jobs_spider/ + spiderJobs/) with divergent code. The base classes in app/services/crawler/ are partially implemented but not used by jobs_spider/. Dedup logic is duplicated in both frameworks.
Table Stakes
Features the system cannot function correctly without. Missing or broken = data pipeline breaks.
1. Unified Platform Base Classes
Why required: Without a single BasePlatformClient that jobs_spider/, app/services/crawler/, AND spiderJobs/ all import from, every bug fix or anti-detection update must be applied in 3 places — guaranteed divergence.
What this means concretely:
- Single
_base.pyin a shared package (not copied between directories) BaseFetcher,BaseSearcher,ApiResult,parse_responsedefined once- Each platform implements
_build_params(),_parse()— nothing else - Sign/auth modules (
_boss_sign.py,_zhilian_sign.py, etc.) imported from the shared package
Complexity: Medium — requires restructuring import paths, not rewriting logic.
| Sub-feature | Status | Notes |
|---|---|---|
| Single base class file | Partial — app/services/crawler/_base.py exists but is a copy, not the source of truth |
Must become canonical |
| Shared sign modules | Partial — _boss_sign.py etc. exist in app/services/crawler/ |
jobs_spider/ uses its own copies |
spiderJobs/ consolidation |
Missing — spiderJobs/core/base.py is yet another copy |
Must be eliminated or unified |
2. Keyword-Driven Dispatch with Crawl State Tracking
Why required: Without atomic keyword assignment with state tracking, multiple ECS nodes will crawl the same keyword simultaneously, producing massive duplicates and wasting API quota.
What this means concretely:
GET /api/v1/keyword/available?reserve=Trueatomically marks a keyword as in-progress- Keyword state machine:
pending → in_progress → completed / failed last_completed_pagepersisted so interrupted crawls can resume from breakpoint- Crawlers report
status=completed|failed+ page count on finish - The
spiderJobs/runner/loop.pyalready implementsreport_crawl_complete()andreport_page_progress()— this is the model to follow
Complexity: Low — pattern already exists in spiderJobs/runner/, needs to be the standard.
| Sub-feature | Status | Notes |
|---|---|---|
| Atomic keyword assignment | Exists in jobs_spider/boss/boos_api.py |
Not in qcwy/zhilian |
| Crawl completion reporting | Exists in spiderJobs/runner/loop.py |
Not in jobs_spider/ |
| Breakpoint resume (last_completed_page) | Exists in spiderJobs/runner/loop.py |
Not in jobs_spider/ |
3. Per-Page Batch Upload (Not End-of-Keyword Batch)
Why required: If a crawler accumulates all pages before pushing, a crash at page 9 of 10 loses everything. Uploading per-page means each page's data is safe before moving to the next.
What this means concretely:
- After each page fetch:
POST /api/v1/universal/data/batch-store-asyncwith that page's data - Do not accumulate across pages in memory
- Report page progress to backend (
report_page_progress()) spiderJobs/runner/loop.pydoes this correctly —jobs_spider/boss/boos_api.pyaccumulates and pushes at the end
Complexity: Low — structural change to push loop.
| Sub-feature | Status | Notes |
|---|---|---|
| Per-page upload | Missing in jobs_spider/boss/ |
spiderJobs/runner/ has it |
| Per-page progress reporting | Missing in jobs_spider/ |
spiderJobs/runner/ has it |
4. Dedup at Ingestion (ClickHouse Query Before Insert)
Why required: Without dedup, repeated crawls of the same keyword accumulate duplicate job records, corrupting analytics counts. ClickHouse's MergeTree eventual dedup is not sufficient for real-time correctness.
What this means concretely:
batch_dedup_filter()queries ClickHouse for existing keys before inserting- Dedup key per platform: Boss =
job_id; QCWY =(job_id, update_date_time); Zhilian =(number, first_publish_time) - Dedup happens in
IngestService, not in the crawler — crawler just pushes raw data - All three platforms must route through the same
IngestService(currently QCWY has separate logic)
Complexity: Low — IngestService + PlatformConfig registry already works; just need all platforms to use it.
| Sub-feature | Status | Notes |
|---|---|---|
batch_dedup_filter() |
Exists and works | Supports 1 or 2 key columns |
| Boss dedup via registry | Exists | Working |
| QCWY dedup via registry | Partial | Separate code path; needs unification |
| Zhilian dedup via registry | Partial | Same issue |
5. Retry with Exponential Backoff
Why required: Platform APIs are unstable — rate limits, transient 5xx errors, proxy failures. Without retry, a single flaky response aborts an entire crawl session.
What this means concretely:
- Retry on: network timeout,
requests.exceptions.RequestException, HTTP 5xx, platform business errors that are transient (not bans) - Do NOT retry on: HTTP 403/429 IP ban signals (switch proxy instead), HTTP 401 token expired (refresh token instead)
- Max retries: 3 attempts with exponential backoff (2s, 4s, 8s jitter)
- After max retries: log error, report
status=failedto backend, continue to next keyword (do not crash)
Current state: No structured retry in either jobs_spider/ or app/services/crawler/. Failures cause immediate abort.
Complexity: Medium — needs a retry decorator/wrapper in _http_client.py.
| Sub-feature | Status | Notes |
|---|---|---|
| Retry on transient errors | Missing | Critical gap |
| Exponential backoff | Missing | |
| Distinguish retryable vs ban | Missing | IPAnomalyDetector detects bans but retry is not wired |
6. IP/Proxy Management with Ban Detection
Why required: Chinese recruitment platforms aggressively detect and block scraper IPs. Without automated ban detection and proxy switching, crawlers die silently and collect no data.
What this means concretely:
IPAnomalyDetectorchecks HTTP 403/429/407, slow response (>5s), business code=35, message containing "IP地址存在异常"- On ban: switch proxy immediately (not after retry), rebuild
requests.Session, log ban event SmartIPManagerhandles: tunnel proxy → proxy pool round-robin → local direct connection with cooldown- Proxy health state is in-memory per crawler process (no shared state needed — each ECS node manages its own pool)
Current state: EXISTS and works in jobs_spider/boss/boos_api.py. NOT present in qcwy or zhilian crawlers. Must be shared via base class.
Complexity: Low — logic exists, needs to be in shared module not duplicated.
| Sub-feature | Status | Notes |
|---|---|---|
IPAnomalyDetector |
Exists in boss + antispider.py | Duplicated — must consolidate |
SmartIPManager |
Exists in boss | Missing from qcwy/zhilian |
| Session rebuild on ban | Exists in boss | Missing from qcwy/zhilian |
7. Token Lifecycle Management (Boss 直聘)
Why required: Boss直聘 uses MPT tokens (WeChat mini-program login tokens) that expire. Without automatic refresh, crawlers stop collecting data silently.
What this means concretely:
GET /api/v1/token/tokensreturns available active Boss tokens- On token-expired response (business code indicating auth failure): mark token as inactive, fetch next available token
- If no tokens available: log error, wait, retry — do not crash
- Token management is Boss-specific; QCWY and Zhilian use cookie/session auth
Current state: Exists in jobs_spider/boss/boos_api.py. Needs to be in the shared Boss client module.
Complexity: Low — already implemented, just needs to stay in shared code.
8. Structured Error Logging per Crawl Session
Why required: Without structured logs, diagnosing why data volume dropped on a given day is impossible. Operators cannot tell if the issue is IP bans, token expiry, keyword exhaustion, or platform API changes.
What this means concretely:
logururotating daily log files per platform, 30-day retention (already exists)- Each log entry must include: platform, keyword (city+job), page, error_type, proxy_used, http_status
- Log at: session start, each page success/failure, proxy switch, token switch, crawl completion
- Errors forwarded to backend
ScheduledTaskRuntable for dashboard visibility
Current state: Logging exists but is inconsistent — boss uses loguru, qcwy/zhilian use print(). Must standardize.
Complexity: Low — configuration change + replacing print() with logger.
Differentiators
Features that make the system operationally robust and reduce manual intervention. Not strictly required for data collection, but essential for unattended production operation.
D1. Crawl Health Dashboard (Frontend)
Value: Operators can see at a glance: which platforms are crawling, how many jobs were ingested today, which IPs are active/banned, whether company cleaning is keeping up with job volume.
What this means concretely:
- Expand existing
monitor.vueto cover job crawling (not just company cleaning) - Per-platform: jobs ingested today (from
IpUploadStats), keywords completed/pending/failed, active crawler IPs - Alert indicators: zero ingestion in last 30 min, all keywords exhausted, token pool empty
- Auto-refresh every 30s (already exists for company cleaning panel)
- Backend API:
/api/v1/analytics/crawler-healthaggregating fromIpUploadStats+keywordtables
Complexity: Medium — requires new backend analytics endpoint + frontend component.
D2. Platform-Specific Anti-Detection Profiles
Value: Each platform has different detection mechanisms. A unified anti-detection config per platform reduces manual tuning when bans increase.
What this means concretely:
- Per-platform config dataclass:
min_delay,max_delay,max_pages_per_keyword,user_agent_rotation,cookie_fields_to_preserve - Boss: must preserve
__zp_sseed__,__zp_sname__,__zp_sts__cookies (anti-bot challenge) - QCWY: standard cookie session, no special handling
- Zhilian: custom header signing (already in
_zhilian_sign.py) - Config loaded from environment variables, not hardcoded
Current state: Delays are configurable via env vars. Cookie handling and platform-specific behavior is hardcoded in each crawler.
Complexity: Low-Medium — config dataclass refactor.
D3. Company Inline Crawl During Job Search
Value: Crawling company details in the same session as job search reduces total requests, shares the same authenticated session/proxy, and keeps company data fresh relative to job postings.
What this means concretely:
- After each page of job results: extract unique company IDs from jobs, fetch company details for new ones
- Session-level
seen_companiesset prevents re-fetching same company in same run - Falls back gracefully if company fetch fails — job data is not blocked on company data
- Already implemented in
spiderJobs/runner/loop.pyas_crawl_companies_from_jobs()— this is the reference
Current state: Implemented in spiderJobs/runner/. Missing from jobs_spider/.
Complexity: Low — pattern exists, needs to be the standard in unified loop.
D4. Keyword Progress Persistence and Resume
Value: ECS instances can be terminated mid-crawl (spot instance preemption). Without persisted progress, the entire keyword must restart. With last_completed_page, the crawler resumes from where it stopped.
What this means concretely:
- Backend stores
last_completed_pageper keyword (already in keyword model) - Crawler reads
start_page = last_completed_page + 1from keyword response - On each successful page:
report_page_progress(keyword_id, page)updates backend - On spot instance termination: next invocation continues from last checkpoint
Current state: spiderJobs/runner/loop.py fully implements this. jobs_spider/ does not.
Complexity: Low — spiderJobs/runner/ is the reference implementation.
D5. Scheduler Task Run History + Alerting
Value: When company cleaning fails silently or a scheduled task stops running (due to a lock deadlock), operators need visibility without watching logs.
What this means concretely:
ScheduledTaskRuntable already records task runs with success/failure status- Email alert when: task fails N consecutive times, or hasn't run in 2x expected interval
ip_alert_job(every 10 min) already does IP anomaly alerting via SMTP — extend this pattern- Frontend: scheduler health section in monitor dashboard
Current state: ScheduledTaskRun model + _record_task_run() exists. Email alerting exists for IP anomalies. Not wired for task-level failures.
Complexity: Medium — extend existing alerting pattern.
D6. Unified Crawl Result Reporting API
Value: Crawlers that report completion status enable automatic keyword queue management — completed keywords can be reset for the next crawl cycle, failed ones flagged for retry.
What this means concretely:
POST /api/v1/keyword/crawl-completewith{keyword_id, status, pages_completed, jobs_found, error_message}- Backend updates keyword state:
completed→ reset topendingafter 24h;failed→ increment failure count - Already partially designed in
spiderJobs/runner/api_client.py
Current state: report_crawl_complete() exists in spiderJobs/runner/api_client.py. Not wired to keyword state machine in backend.
Complexity: Medium — backend endpoint + keyword state machine.
Anti-Features
Things to deliberately NOT build. These appear useful but create more problems than they solve in this context.
AF1. Shared Crawl State via Redis/Database
What it looks like: Storing proxy health, ban status, session cookies in Redis so multiple ECS nodes share state.
Why avoid: Adds Redis as a hard dependency for crawlers. ECS spot instances are ephemeral — shared state becomes stale or corrupted when nodes die mid-operation. Each crawler process managing its own in-memory state is simpler, more reliable, and eliminates inter-node coordination. The existing SmartIPManager in-memory approach is correct.
What to do instead: Each ECS node gets its own proxy pool slice. Node-level state stays in-process. Only keyword assignment (which is already serialized via the backend API) needs cross-node coordination.
AF2. Playwright / Browser Automation as Default
What it looks like: Using headless Chrome via Playwright for all platform requests to bypass detection.
Why avoid: Playwright adds 3–5x resource overhead per request. Boss直聘's WeChat mini-program API and QCWY/Zhilian's mobile APIs are accessible via direct HTTP with correct headers and TLS fingerprints (requests_go + TLS_CHROME_LATEST). Browser automation is only justified when the platform serves JS-rendered content with no underlying API — which is not the case for any of the three target platforms. Playwright is already listed as a dependency but should be a last-resort escape hatch, not the default.
What to do instead: Keep requests_go with TLS_CHROME_LATEST and random JA3 as the standard HTTP client. Use Playwright only for specific cookie acquisition flows if needed.
AF3. Real-time Data Processing / Stream Pipeline
What it looks like: Adding Kafka/Flink/Redis Streams to process job data as it arrives.
Why avoid: The current volume (thousands of jobs per day) does not justify stream infrastructure. ClickHouse handles batch inserts efficiently. The 5-minute company cleaning scheduler + async ingest pipeline covers all real-time requirements. Adding a streaming layer multiplies operational complexity with no data quality benefit at this scale.
What to do instead: Keep the batch-insert-with-dedup pattern. Optimize batch sizes if needed (current single-batch-per-page is correct).
AF4. ML-Based Anti-Detection (Behavioral Fingerprinting)
What it looks like: Training ML models to randomize mouse movement patterns, request timing distributions, or user session behavior to mimic human browsing.
Why avoid: The three target platforms are mobile/mini-program APIs, not web browser interfaces. Behavioral fingerprinting only matters for full browser sessions. The current approach (random delays 10–20s, TLS fingerprint spoofing, IP rotation) addresses the actual detection vectors. ML-based solutions add training/maintenance cost with no benefit against the actual detection mechanisms used.
What to do instead: Tune the existing IPStrategyConfig parameters (delay range, failure thresholds) per platform via environment variables.
AF5. Crawler-Side Data Normalization / Schema Transformation
What it looks like: Having crawlers parse, clean, and normalize job data fields (salary parsing, city name normalization, skill extraction) before pushing to the backend.
Why avoid: Normalization logic changes frequently as platforms update their response schemas. If normalization breaks, raw data is already lost — it was never stored. Storing raw JSON and normalizing downstream (in ClickHouse views or a separate cleaning service) preserves the original data for re-processing. The existing CleaningService covers this correctly.
What to do instead: Crawlers push raw JSON. Backend's IngestService stores raw JSON in json_data column. ClickHouse views and CleaningService handle field extraction and normalization.
AF6. Separate Crawler Orchestration Database
What it looks like: A dedicated database (separate from MySQL/ClickHouse) to manage crawler state, job queues, worker registrations.
Why avoid: The existing keyword table in MySQL + IpUploadStats + ScheduledTaskRun already provide sufficient crawler state management. Adding a third database (e.g., MongoDB for crawler state) fragments operational concerns and adds another system to maintain. The backend API is the correct coordination point for ECS crawler nodes.
What to do instead: Extend the keyword model with the fields needed for state tracking (last_completed_page, status, failure_count). The existing pattern is sufficient.
Feature Dependencies
Unified Base Classes (1) → IP/Proxy Management (6) [proxy logic in shared _http_client.py]
Unified Base Classes (1) → Retry with Backoff (5) [retry wrapper in shared _http_client.py]
Unified Base Classes (1) → Token Lifecycle (7) [token client in shared boss module]
Keyword Dispatch (2) → Per-Page Upload (3) [keyword_id needed for page progress reporting]
Keyword Dispatch (2) → Keyword Progress Resume (D4) [last_completed_page field]
Per-Page Upload (3) → Dedup at Ingestion (4) [smaller batches = faster dedup queries]
IP/Proxy Management (6) → Crawl Health Dashboard (D1) [IpUploadStats feeds dashboard]
Keyword Dispatch (2) → Unified Crawl Reporting API (D6) [keyword state machine]
MVP Recommendation for This Refactor
The refactor's highest-leverage work, in priority order:
Phase 1 — Foundation (unblock everything else):
- Unified base classes — canonicalize
_base.py,_http_client.py, sign modules into a single importable package SmartIPManager+IPAnomalyDetectormoved to shared module, used by all three platforms- Standardized logging (
loguru) in all crawlers — replaceprint()calls
Phase 2 — Data reliability:
4. Per-page upload loop (following spiderJobs/runner/loop.py pattern) for all three platforms
5. Retry with backoff in _http_client.py
6. All platforms route through IngestService + PlatformConfig registry for dedup
Phase 3 — Operational visibility: 7. Crawl completion reporting API + keyword state machine 8. Crawler health dashboard extension (job ingestion metrics, keyword status)
Defer:
- Platform anti-detection profiles (D2): low urgency, current env-var approach is adequate
- Scheduler task alerting (D5): existing SMTP infra works; wire task failures as follow-up
- Company inline crawl (D3): correct pattern exists in
spiderJobs/runner/; include if refactoring that module
Sources
- Codebase analysis:
.planning/codebase/ARCHITECTURE.md(2026-03-21) app/services/crawler/_base.py— BaseFetcher/BaseSearcher template method patternapp/services/crawler/_http_client.py— HTTPClient, proxy priority logic, TLS fingerprintapp/core/algorithms/antispider.py— IPStrategyConfig, IPAnomalyDetector, SmartIPManagerapp/services/ingest/dedup.py— batch_dedup_filter, 1/2 column key supportapp/services/ingest/service.py— IngestService.store_batch() pipelineapp/services/ingest/registry.py— PlatformConfig registry, DedupFieldSpecspiderJobs/runner/loop.py— run_crawl_loop(), per-page upload, breakpoint resume, company inline crawljobs_spider/boss/boos_api.py(lines 1–100) — SmartIPManager usage, sleep_random_between(), token managementapp/core/ip_tracking.py— IpTrackingMiddleware, IpUploadStatsapp/core/locks.py— DistributedLock (Redis + file fallback).planning/PROJECT.md— refactor scope, constraints, key decisions