23 KiB
Raw Permalink Blame History

Feature Landscape

Domain: Multi-platform job data crawler framework (Boss直聘 / 前程无忧 / 智联招聘) Researched: 2026-03-21 Context: Refactoring existing crawler system — not a greenfield build. Features are evaluated against what exists and what the refactor must deliver.


Current State Baseline

Before categorizing features, this is what the existing system already has (do not re-build):

Capability Implementation Location
Keyword-driven crawl dispatch /api/v1/keyword/available pull model app/api/v1/keyword/
Batch data ingestion pipeline IngestService + PlatformConfig registry app/services/ingest/
ClickHouse dedup on insert batch_dedup_filter() — 1 or 2 column key app/services/ingest/dedup.py
Random delay anti-detection sleep_random_between(), 1020s enforced jobs_spider/boss/boos_api.py
Proxy rotation + tunnel proxy SmartIPManager, HTTPClient app/core/algorithms/antispider.py, _http_client.py
IP anomaly detection IPAnomalyDetector — 403/429/407, slow response, code=35 _http_client.py
TLS fingerprint spoofing requests_go with TLS_CHROME_LATEST, random JA3 _http_client.py
Structured logging loguru rotating daily files, 30-day retention all modules
Scheduled job framework APScheduler, distributed file/Redis lock app/core/scheduler.py, locks.py
IP upload tracking IpTrackingMiddlewareIpUploadStats table app/core/ip_tracking.py
Company cleaning pipeline CompanyCleaner collect→process→store, every 5 min app/services/company_cleaner.py
Frontend monitoring (partial) monitor.vue — pending queue, today processed, completion rate web/src/views/cleaning/monitor.vue
Unified crawl loop (partial) spiderJobs/runner/loop.py — page progress reporting spiderJobs/runner/
BaseFetcher / BaseSearcher Template method base classes app/services/crawler/_base.py

Critical gap: Two parallel crawler frameworks exist (jobs_spider/ + spiderJobs/) with divergent code. The base classes in app/services/crawler/ are partially implemented but not used by jobs_spider/. Dedup logic is duplicated in both frameworks.


Table Stakes

Features the system cannot function correctly without. Missing or broken = data pipeline breaks.

1. Unified Platform Base Classes

Why required: Without a single BasePlatformClient that jobs_spider/, app/services/crawler/, AND spiderJobs/ all import from, every bug fix or anti-detection update must be applied in 3 places — guaranteed divergence.

What this means concretely:

  • Single _base.py in a shared package (not copied between directories)
  • BaseFetcher, BaseSearcher, ApiResult, parse_response defined once
  • Each platform implements _build_params(), _parse() — nothing else
  • Sign/auth modules (_boss_sign.py, _zhilian_sign.py, etc.) imported from the shared package

Complexity: Medium — requires restructuring import paths, not rewriting logic.

Sub-feature Status Notes
Single base class file Partial — app/services/crawler/_base.py exists but is a copy, not the source of truth Must become canonical
Shared sign modules Partial — _boss_sign.py etc. exist in app/services/crawler/ jobs_spider/ uses its own copies
spiderJobs/ consolidation Missing — spiderJobs/core/base.py is yet another copy Must be eliminated or unified

2. Keyword-Driven Dispatch with Crawl State Tracking

Why required: Without atomic keyword assignment with state tracking, multiple ECS nodes will crawl the same keyword simultaneously, producing massive duplicates and wasting API quota.

What this means concretely:

  • GET /api/v1/keyword/available?reserve=True atomically marks a keyword as in-progress
  • Keyword state machine: pending → in_progress → completed / failed
  • last_completed_page persisted so interrupted crawls can resume from breakpoint
  • Crawlers report status=completed|failed + page count on finish
  • The spiderJobs/runner/loop.py already implements report_crawl_complete() and report_page_progress() — this is the model to follow

Complexity: Low — pattern already exists in spiderJobs/runner/, needs to be the standard.

Sub-feature Status Notes
Atomic keyword assignment Exists in jobs_spider/boss/boos_api.py Not in qcwy/zhilian
Crawl completion reporting Exists in spiderJobs/runner/loop.py Not in jobs_spider/
Breakpoint resume (last_completed_page) Exists in spiderJobs/runner/loop.py Not in jobs_spider/

3. Per-Page Batch Upload (Not End-of-Keyword Batch)

Why required: If a crawler accumulates all pages before pushing, a crash at page 9 of 10 loses everything. Uploading per-page means each page's data is safe before moving to the next.

What this means concretely:

  • After each page fetch: POST /api/v1/universal/data/batch-store-async with that page's data
  • Do not accumulate across pages in memory
  • Report page progress to backend (report_page_progress())
  • spiderJobs/runner/loop.py does this correctly — jobs_spider/boss/boos_api.py accumulates and pushes at the end

Complexity: Low — structural change to push loop.

Sub-feature Status Notes
Per-page upload Missing in jobs_spider/boss/ spiderJobs/runner/ has it
Per-page progress reporting Missing in jobs_spider/ spiderJobs/runner/ has it

4. Dedup at Ingestion (ClickHouse Query Before Insert)

Why required: Without dedup, repeated crawls of the same keyword accumulate duplicate job records, corrupting analytics counts. ClickHouse's MergeTree eventual dedup is not sufficient for real-time correctness.

What this means concretely:

  • batch_dedup_filter() queries ClickHouse for existing keys before inserting
  • Dedup key per platform: Boss = job_id; QCWY = (job_id, update_date_time); Zhilian = (number, first_publish_time)
  • Dedup happens in IngestService, not in the crawler — crawler just pushes raw data
  • All three platforms must route through the same IngestService (currently QCWY has separate logic)

Complexity: Low — IngestService + PlatformConfig registry already works; just need all platforms to use it.

Sub-feature Status Notes
batch_dedup_filter() Exists and works Supports 1 or 2 key columns
Boss dedup via registry Exists Working
QCWY dedup via registry Partial Separate code path; needs unification
Zhilian dedup via registry Partial Same issue

5. Retry with Exponential Backoff

Why required: Platform APIs are unstable — rate limits, transient 5xx errors, proxy failures. Without retry, a single flaky response aborts an entire crawl session.

What this means concretely:

  • Retry on: network timeout, requests.exceptions.RequestException, HTTP 5xx, platform business errors that are transient (not bans)
  • Do NOT retry on: HTTP 403/429 IP ban signals (switch proxy instead), HTTP 401 token expired (refresh token instead)
  • Max retries: 3 attempts with exponential backoff (2s, 4s, 8s jitter)
  • After max retries: log error, report status=failed to backend, continue to next keyword (do not crash)

Current state: No structured retry in either jobs_spider/ or app/services/crawler/. Failures cause immediate abort.

Complexity: Medium — needs a retry decorator/wrapper in _http_client.py.

Sub-feature Status Notes
Retry on transient errors Missing Critical gap
Exponential backoff Missing
Distinguish retryable vs ban Missing IPAnomalyDetector detects bans but retry is not wired

6. IP/Proxy Management with Ban Detection

Why required: Chinese recruitment platforms aggressively detect and block scraper IPs. Without automated ban detection and proxy switching, crawlers die silently and collect no data.

What this means concretely:

  • IPAnomalyDetector checks HTTP 403/429/407, slow response (>5s), business code=35, message containing "IP地址存在异常"
  • On ban: switch proxy immediately (not after retry), rebuild requests.Session, log ban event
  • SmartIPManager handles: tunnel proxy → proxy pool round-robin → local direct connection with cooldown
  • Proxy health state is in-memory per crawler process (no shared state needed — each ECS node manages its own pool)

Current state: EXISTS and works in jobs_spider/boss/boos_api.py. NOT present in qcwy or zhilian crawlers. Must be shared via base class.

Complexity: Low — logic exists, needs to be in shared module not duplicated.

Sub-feature Status Notes
IPAnomalyDetector Exists in boss + antispider.py Duplicated — must consolidate
SmartIPManager Exists in boss Missing from qcwy/zhilian
Session rebuild on ban Exists in boss Missing from qcwy/zhilian

7. Token Lifecycle Management (Boss 直聘)

Why required: Boss直聘 uses MPT tokens (WeChat mini-program login tokens) that expire. Without automatic refresh, crawlers stop collecting data silently.

What this means concretely:

  • GET /api/v1/token/tokens returns available active Boss tokens
  • On token-expired response (business code indicating auth failure): mark token as inactive, fetch next available token
  • If no tokens available: log error, wait, retry — do not crash
  • Token management is Boss-specific; QCWY and Zhilian use cookie/session auth

Current state: Exists in jobs_spider/boss/boos_api.py. Needs to be in the shared Boss client module.

Complexity: Low — already implemented, just needs to stay in shared code.


8. Structured Error Logging per Crawl Session

Why required: Without structured logs, diagnosing why data volume dropped on a given day is impossible. Operators cannot tell if the issue is IP bans, token expiry, keyword exhaustion, or platform API changes.

What this means concretely:

  • loguru rotating daily log files per platform, 30-day retention (already exists)
  • Each log entry must include: platform, keyword (city+job), page, error_type, proxy_used, http_status
  • Log at: session start, each page success/failure, proxy switch, token switch, crawl completion
  • Errors forwarded to backend ScheduledTaskRun table for dashboard visibility

Current state: Logging exists but is inconsistent — boss uses loguru, qcwy/zhilian use print(). Must standardize.

Complexity: Low — configuration change + replacing print() with logger.


Differentiators

Features that make the system operationally robust and reduce manual intervention. Not strictly required for data collection, but essential for unattended production operation.

D1. Crawl Health Dashboard (Frontend)

Value: Operators can see at a glance: which platforms are crawling, how many jobs were ingested today, which IPs are active/banned, whether company cleaning is keeping up with job volume.

What this means concretely:

  • Expand existing monitor.vue to cover job crawling (not just company cleaning)
  • Per-platform: jobs ingested today (from IpUploadStats), keywords completed/pending/failed, active crawler IPs
  • Alert indicators: zero ingestion in last 30 min, all keywords exhausted, token pool empty
  • Auto-refresh every 30s (already exists for company cleaning panel)
  • Backend API: /api/v1/analytics/crawler-health aggregating from IpUploadStats + keyword tables

Complexity: Medium — requires new backend analytics endpoint + frontend component.


D2. Platform-Specific Anti-Detection Profiles

Value: Each platform has different detection mechanisms. A unified anti-detection config per platform reduces manual tuning when bans increase.

What this means concretely:

  • Per-platform config dataclass: min_delay, max_delay, max_pages_per_keyword, user_agent_rotation, cookie_fields_to_preserve
  • Boss: must preserve __zp_sseed__, __zp_sname__, __zp_sts__ cookies (anti-bot challenge)
  • QCWY: standard cookie session, no special handling
  • Zhilian: custom header signing (already in _zhilian_sign.py)
  • Config loaded from environment variables, not hardcoded

Current state: Delays are configurable via env vars. Cookie handling and platform-specific behavior is hardcoded in each crawler.

Complexity: Low-Medium — config dataclass refactor.


Value: Crawling company details in the same session as job search reduces total requests, shares the same authenticated session/proxy, and keeps company data fresh relative to job postings.

What this means concretely:

  • After each page of job results: extract unique company IDs from jobs, fetch company details for new ones
  • Session-level seen_companies set prevents re-fetching same company in same run
  • Falls back gracefully if company fetch fails — job data is not blocked on company data
  • Already implemented in spiderJobs/runner/loop.py as _crawl_companies_from_jobs() — this is the reference

Current state: Implemented in spiderJobs/runner/. Missing from jobs_spider/.

Complexity: Low — pattern exists, needs to be the standard in unified loop.


D4. Keyword Progress Persistence and Resume

Value: ECS instances can be terminated mid-crawl (spot instance preemption). Without persisted progress, the entire keyword must restart. With last_completed_page, the crawler resumes from where it stopped.

What this means concretely:

  • Backend stores last_completed_page per keyword (already in keyword model)
  • Crawler reads start_page = last_completed_page + 1 from keyword response
  • On each successful page: report_page_progress(keyword_id, page) updates backend
  • On spot instance termination: next invocation continues from last checkpoint

Current state: spiderJobs/runner/loop.py fully implements this. jobs_spider/ does not.

Complexity: Low — spiderJobs/runner/ is the reference implementation.


D5. Scheduler Task Run History + Alerting

Value: When company cleaning fails silently or a scheduled task stops running (due to a lock deadlock), operators need visibility without watching logs.

What this means concretely:

  • ScheduledTaskRun table already records task runs with success/failure status
  • Email alert when: task fails N consecutive times, or hasn't run in 2x expected interval
  • ip_alert_job (every 10 min) already does IP anomaly alerting via SMTP — extend this pattern
  • Frontend: scheduler health section in monitor dashboard

Current state: ScheduledTaskRun model + _record_task_run() exists. Email alerting exists for IP anomalies. Not wired for task-level failures.

Complexity: Medium — extend existing alerting pattern.


D6. Unified Crawl Result Reporting API

Value: Crawlers that report completion status enable automatic keyword queue management — completed keywords can be reset for the next crawl cycle, failed ones flagged for retry.

What this means concretely:

  • POST /api/v1/keyword/crawl-complete with {keyword_id, status, pages_completed, jobs_found, error_message}
  • Backend updates keyword state: completed → reset to pending after 24h; failed → increment failure count
  • Already partially designed in spiderJobs/runner/api_client.py

Current state: report_crawl_complete() exists in spiderJobs/runner/api_client.py. Not wired to keyword state machine in backend.

Complexity: Medium — backend endpoint + keyword state machine.


Anti-Features

Things to deliberately NOT build. These appear useful but create more problems than they solve in this context.

AF1. Shared Crawl State via Redis/Database

What it looks like: Storing proxy health, ban status, session cookies in Redis so multiple ECS nodes share state.

Why avoid: Adds Redis as a hard dependency for crawlers. ECS spot instances are ephemeral — shared state becomes stale or corrupted when nodes die mid-operation. Each crawler process managing its own in-memory state is simpler, more reliable, and eliminates inter-node coordination. The existing SmartIPManager in-memory approach is correct.

What to do instead: Each ECS node gets its own proxy pool slice. Node-level state stays in-process. Only keyword assignment (which is already serialized via the backend API) needs cross-node coordination.


AF2. Playwright / Browser Automation as Default

What it looks like: Using headless Chrome via Playwright for all platform requests to bypass detection.

Why avoid: Playwright adds 35x resource overhead per request. Boss直聘's WeChat mini-program API and QCWY/Zhilian's mobile APIs are accessible via direct HTTP with correct headers and TLS fingerprints (requests_go + TLS_CHROME_LATEST). Browser automation is only justified when the platform serves JS-rendered content with no underlying API — which is not the case for any of the three target platforms. Playwright is already listed as a dependency but should be a last-resort escape hatch, not the default.

What to do instead: Keep requests_go with TLS_CHROME_LATEST and random JA3 as the standard HTTP client. Use Playwright only for specific cookie acquisition flows if needed.


AF3. Real-time Data Processing / Stream Pipeline

What it looks like: Adding Kafka/Flink/Redis Streams to process job data as it arrives.

Why avoid: The current volume (thousands of jobs per day) does not justify stream infrastructure. ClickHouse handles batch inserts efficiently. The 5-minute company cleaning scheduler + async ingest pipeline covers all real-time requirements. Adding a streaming layer multiplies operational complexity with no data quality benefit at this scale.

What to do instead: Keep the batch-insert-with-dedup pattern. Optimize batch sizes if needed (current single-batch-per-page is correct).


AF4. ML-Based Anti-Detection (Behavioral Fingerprinting)

What it looks like: Training ML models to randomize mouse movement patterns, request timing distributions, or user session behavior to mimic human browsing.

Why avoid: The three target platforms are mobile/mini-program APIs, not web browser interfaces. Behavioral fingerprinting only matters for full browser sessions. The current approach (random delays 1020s, TLS fingerprint spoofing, IP rotation) addresses the actual detection vectors. ML-based solutions add training/maintenance cost with no benefit against the actual detection mechanisms used.

What to do instead: Tune the existing IPStrategyConfig parameters (delay range, failure thresholds) per platform via environment variables.


AF5. Crawler-Side Data Normalization / Schema Transformation

What it looks like: Having crawlers parse, clean, and normalize job data fields (salary parsing, city name normalization, skill extraction) before pushing to the backend.

Why avoid: Normalization logic changes frequently as platforms update their response schemas. If normalization breaks, raw data is already lost — it was never stored. Storing raw JSON and normalizing downstream (in ClickHouse views or a separate cleaning service) preserves the original data for re-processing. The existing CleaningService covers this correctly.

What to do instead: Crawlers push raw JSON. Backend's IngestService stores raw JSON in json_data column. ClickHouse views and CleaningService handle field extraction and normalization.


AF6. Separate Crawler Orchestration Database

What it looks like: A dedicated database (separate from MySQL/ClickHouse) to manage crawler state, job queues, worker registrations.

Why avoid: The existing keyword table in MySQL + IpUploadStats + ScheduledTaskRun already provide sufficient crawler state management. Adding a third database (e.g., MongoDB for crawler state) fragments operational concerns and adds another system to maintain. The backend API is the correct coordination point for ECS crawler nodes.

What to do instead: Extend the keyword model with the fields needed for state tracking (last_completed_page, status, failure_count). The existing pattern is sufficient.


Feature Dependencies

Unified Base Classes (1) → IP/Proxy Management (6) [proxy logic in shared _http_client.py]
Unified Base Classes (1) → Retry with Backoff (5) [retry wrapper in shared _http_client.py]
Unified Base Classes (1) → Token Lifecycle (7) [token client in shared boss module]
Keyword Dispatch (2) → Per-Page Upload (3) [keyword_id needed for page progress reporting]
Keyword Dispatch (2) → Keyword Progress Resume (D4) [last_completed_page field]
Per-Page Upload (3) → Dedup at Ingestion (4) [smaller batches = faster dedup queries]
IP/Proxy Management (6) → Crawl Health Dashboard (D1) [IpUploadStats feeds dashboard]
Keyword Dispatch (2) → Unified Crawl Reporting API (D6) [keyword state machine]

MVP Recommendation for This Refactor

The refactor's highest-leverage work, in priority order:

Phase 1 — Foundation (unblock everything else):

  1. Unified base classes — canonicalize _base.py, _http_client.py, sign modules into a single importable package
  2. SmartIPManager + IPAnomalyDetector moved to shared module, used by all three platforms
  3. Standardized logging (loguru) in all crawlers — replace print() calls

Phase 2 — Data reliability: 4. Per-page upload loop (following spiderJobs/runner/loop.py pattern) for all three platforms 5. Retry with backoff in _http_client.py 6. All platforms route through IngestService + PlatformConfig registry for dedup

Phase 3 — Operational visibility: 7. Crawl completion reporting API + keyword state machine 8. Crawler health dashboard extension (job ingestion metrics, keyword status)

Defer:

  • Platform anti-detection profiles (D2): low urgency, current env-var approach is adequate
  • Scheduler task alerting (D5): existing SMTP infra works; wire task failures as follow-up
  • Company inline crawl (D3): correct pattern exists in spiderJobs/runner/; include if refactoring that module

Sources

  • Codebase analysis: .planning/codebase/ARCHITECTURE.md (2026-03-21)
  • app/services/crawler/_base.py — BaseFetcher/BaseSearcher template method pattern
  • app/services/crawler/_http_client.py — HTTPClient, proxy priority logic, TLS fingerprint
  • app/core/algorithms/antispider.py — IPStrategyConfig, IPAnomalyDetector, SmartIPManager
  • app/services/ingest/dedup.py — batch_dedup_filter, 1/2 column key support
  • app/services/ingest/service.py — IngestService.store_batch() pipeline
  • app/services/ingest/registry.py — PlatformConfig registry, DedupFieldSpec
  • spiderJobs/runner/loop.py — run_crawl_loop(), per-page upload, breakpoint resume, company inline crawl
  • jobs_spider/boss/boos_api.py (lines 1100) — SmartIPManager usage, sleep_random_between(), token management
  • app/core/ip_tracking.py — IpTrackingMiddleware, IpUploadStats
  • app/core/locks.py — DistributedLock (Redis + file fallback)
  • .planning/PROJECT.md — refactor scope, constraints, key decisions