win f3005ef525 docs: add research findings (stack, features, architecture, pitfalls, summary)

2026-03-21 16:36:37 +08:00

23 KiB

Raw Permalink Blame History

Feature Landscape

Domain: Multi-platform job data crawler framework (Boss直聘 / 前程无忧 / 智联招聘) Researched: 2026-03-21 Context: Refactoring existing crawler system — not a greenfield build. Features are evaluated against what exists and what the refactor must deliver.

Current State Baseline

Before categorizing features, this is what the existing system already has (do not re-build):

Capability	Implementation	Location
Keyword-driven crawl dispatch	`/api/v1/keyword/available` pull model	`app/api/v1/keyword/`
Batch data ingestion pipeline	`IngestService` + `PlatformConfig` registry	`app/services/ingest/`
ClickHouse dedup on insert	`batch_dedup_filter()` — 1 or 2 column key	`app/services/ingest/dedup.py`
Random delay anti-detection	`sleep_random_between()`, 10–20s enforced	`jobs_spider/boss/boos_api.py`
Proxy rotation + tunnel proxy	`SmartIPManager`, `HTTPClient`	`app/core/algorithms/antispider.py`, `_http_client.py`
IP anomaly detection	`IPAnomalyDetector` — 403/429/407, slow response, code=35	`_http_client.py`
TLS fingerprint spoofing	`requests_go` with `TLS_CHROME_LATEST`, random JA3	`_http_client.py`
Structured logging	`loguru` rotating daily files, 30-day retention	all modules
Scheduled job framework	APScheduler, distributed file/Redis lock	`app/core/scheduler.py`, `locks.py`
IP upload tracking	`IpTrackingMiddleware` → `IpUploadStats` table	`app/core/ip_tracking.py`
Company cleaning pipeline	`CompanyCleaner` collect→process→store, every 5 min	`app/services/company_cleaner.py`
Frontend monitoring (partial)	`monitor.vue` — pending queue, today processed, completion rate	`web/src/views/cleaning/monitor.vue`
Unified crawl loop (partial)	`spiderJobs/runner/loop.py` — page progress reporting	`spiderJobs/runner/`
BaseFetcher / BaseSearcher	Template method base classes	`app/services/crawler/_base.py`

Critical gap: Two parallel crawler frameworks exist (jobs_spider/ + spiderJobs/) with divergent code. The base classes in app/services/crawler/ are partially implemented but not used by jobs_spider/. Dedup logic is duplicated in both frameworks.

Table Stakes

Features the system cannot function correctly without. Missing or broken = data pipeline breaks.

1. Unified Platform Base Classes

Why required: Without a single BasePlatformClient that jobs_spider/, app/services/crawler/, AND spiderJobs/ all import from, every bug fix or anti-detection update must be applied in 3 places — guaranteed divergence.

What this means concretely:

Single _base.py in a shared package (not copied between directories)
BaseFetcher, BaseSearcher, ApiResult, parse_response defined once
Each platform implements _build_params(), _parse() — nothing else
Sign/auth modules (_boss_sign.py, _zhilian_sign.py, etc.) imported from the shared package

Complexity: Medium — requires restructuring import paths, not rewriting logic.

Sub-feature	Status	Notes
Single base class file	Partial — `app/services/crawler/_base.py` exists but is a copy, not the source of truth	Must become canonical
Shared sign modules	Partial — `_boss_sign.py` etc. exist in `app/services/crawler/`	`jobs_spider/` uses its own copies
`spiderJobs/` consolidation	Missing — `spiderJobs/core/base.py` is yet another copy	Must be eliminated or unified

2. Keyword-Driven Dispatch with Crawl State Tracking

Why required: Without atomic keyword assignment with state tracking, multiple ECS nodes will crawl the same keyword simultaneously, producing massive duplicates and wasting API quota.

What this means concretely:

GET /api/v1/keyword/available?reserve=True atomically marks a keyword as in-progress
Keyword state machine: pending → in_progress → completed / failed
last_completed_page persisted so interrupted crawls can resume from breakpoint
Crawlers report status=completed|failed + page count on finish
The spiderJobs/runner/loop.py already implements report_crawl_complete() and report_page_progress() — this is the model to follow

Complexity: Low — pattern already exists in spiderJobs/runner/, needs to be the standard.

Sub-feature	Status	Notes
Atomic keyword assignment	Exists in `jobs_spider/boss/boos_api.py`	Not in qcwy/zhilian
Crawl completion reporting	Exists in `spiderJobs/runner/loop.py`	Not in `jobs_spider/`
Breakpoint resume (last_completed_page)	Exists in `spiderJobs/runner/loop.py`	Not in `jobs_spider/`

3. Per-Page Batch Upload (Not End-of-Keyword Batch)

Why required: If a crawler accumulates all pages before pushing, a crash at page 9 of 10 loses everything. Uploading per-page means each page's data is safe before moving to the next.

What this means concretely:

After each page fetch: POST /api/v1/universal/data/batch-store-async with that page's data
Do not accumulate across pages in memory
Report page progress to backend (report_page_progress())
spiderJobs/runner/loop.py does this correctly — jobs_spider/boss/boos_api.py accumulates and pushes at the end

Complexity: Low — structural change to push loop.

Sub-feature	Status	Notes
Per-page upload	Missing in `jobs_spider/boss/`	`spiderJobs/runner/` has it
Per-page progress reporting	Missing in `jobs_spider/`	`spiderJobs/runner/` has it

4. Dedup at Ingestion (ClickHouse Query Before Insert)

Why required: Without dedup, repeated crawls of the same keyword accumulate duplicate job records, corrupting analytics counts. ClickHouse's MergeTree eventual dedup is not sufficient for real-time correctness.

What this means concretely:

batch_dedup_filter() queries ClickHouse for existing keys before inserting
Dedup key per platform: Boss = job_id; QCWY = (job_id, update_date_time); Zhilian = (number, first_publish_time)
Dedup happens in IngestService, not in the crawler — crawler just pushes raw data
All three platforms must route through the same IngestService (currently QCWY has separate logic)

Complexity: Low — IngestService + PlatformConfig registry already works; just need all platforms to use it.

Sub-feature	Status	Notes
`batch_dedup_filter()`	Exists and works	Supports 1 or 2 key columns
Boss dedup via registry	Exists	Working
QCWY dedup via registry	Partial	Separate code path; needs unification
Zhilian dedup via registry	Partial	Same issue

5. Retry with Exponential Backoff

Why required: Platform APIs are unstable — rate limits, transient 5xx errors, proxy failures. Without retry, a single flaky response aborts an entire crawl session.

What this means concretely:

Retry on: network timeout, requests.exceptions.RequestException, HTTP 5xx, platform business errors that are transient (not bans)
Do NOT retry on: HTTP 403/429 IP ban signals (switch proxy instead), HTTP 401 token expired (refresh token instead)
Max retries: 3 attempts with exponential backoff (2s, 4s, 8s jitter)
After max retries: log error, report status=failed to backend, continue to next keyword (do not crash)

Current state: No structured retry in either jobs_spider/ or app/services/crawler/. Failures cause immediate abort.

Complexity: Medium — needs a retry decorator/wrapper in _http_client.py.

Sub-feature	Status	Notes
Retry on transient errors	Missing	Critical gap
Exponential backoff	Missing
Distinguish retryable vs ban	Missing	IPAnomalyDetector detects bans but retry is not wired

6. IP/Proxy Management with Ban Detection

Why required: Chinese recruitment platforms aggressively detect and block scraper IPs. Without automated ban detection and proxy switching, crawlers die silently and collect no data.

What this means concretely:

IPAnomalyDetector checks HTTP 403/429/407, slow response (>5s), business code=35, message containing "IP地址存在异常"
On ban: switch proxy immediately (not after retry), rebuild requests.Session, log ban event
SmartIPManager handles: tunnel proxy → proxy pool round-robin → local direct connection with cooldown
Proxy health state is in-memory per crawler process (no shared state needed — each ECS node manages its own pool)

Current state: EXISTS and works in jobs_spider/boss/boos_api.py. NOT present in qcwy or zhilian crawlers. Must be shared via base class.

Complexity: Low — logic exists, needs to be in shared module not duplicated.

Sub-feature	Status	Notes
`IPAnomalyDetector`	Exists in boss + antispider.py	Duplicated — must consolidate
`SmartIPManager`	Exists in boss	Missing from qcwy/zhilian
Session rebuild on ban	Exists in boss	Missing from qcwy/zhilian

7. Token Lifecycle Management (Boss 直聘)

Why required: Boss直聘 uses MPT tokens (WeChat mini-program login tokens) that expire. Without automatic refresh, crawlers stop collecting data silently.

What this means concretely:

GET /api/v1/token/tokens returns available active Boss tokens
On token-expired response (business code indicating auth failure): mark token as inactive, fetch next available token
If no tokens available: log error, wait, retry — do not crash
Token management is Boss-specific; QCWY and Zhilian use cookie/session auth

Current state: Exists in jobs_spider/boss/boos_api.py. Needs to be in the shared Boss client module.

Complexity: Low — already implemented, just needs to stay in shared code.

8. Structured Error Logging per Crawl Session

Why required: Without structured logs, diagnosing why data volume dropped on a given day is impossible. Operators cannot tell if the issue is IP bans, token expiry, keyword exhaustion, or platform API changes.

What this means concretely:

loguru rotating daily log files per platform, 30-day retention (already exists)
Each log entry must include: platform, keyword (city+job), page, error_type, proxy_used, http_status
Log at: session start, each page success/failure, proxy switch, token switch, crawl completion
Errors forwarded to backend ScheduledTaskRun table for dashboard visibility

Current state: Logging exists but is inconsistent — boss uses loguru, qcwy/zhilian use print(). Must standardize.

Complexity: Low — configuration change + replacing print() with logger.

Differentiators

Features that make the system operationally robust and reduce manual intervention. Not strictly required for data collection, but essential for unattended production operation.

D1. Crawl Health Dashboard (Frontend)

Value: Operators can see at a glance: which platforms are crawling, how many jobs were ingested today, which IPs are active/banned, whether company cleaning is keeping up with job volume.

What this means concretely:

Expand existing monitor.vue to cover job crawling (not just company cleaning)
Per-platform: jobs ingested today (from IpUploadStats), keywords completed/pending/failed, active crawler IPs
Alert indicators: zero ingestion in last 30 min, all keywords exhausted, token pool empty
Auto-refresh every 30s (already exists for company cleaning panel)
Backend API: /api/v1/analytics/crawler-health aggregating from IpUploadStats + keyword tables

Complexity: Medium — requires new backend analytics endpoint + frontend component.

D2. Platform-Specific Anti-Detection Profiles

Value: Each platform has different detection mechanisms. A unified anti-detection config per platform reduces manual tuning when bans increase.

What this means concretely:

Per-platform config dataclass: min_delay, max_delay, max_pages_per_keyword, user_agent_rotation, cookie_fields_to_preserve
Boss: must preserve __zp_sseed__, __zp_sname__, __zp_sts__ cookies (anti-bot challenge)
QCWY: standard cookie session, no special handling
Zhilian: custom header signing (already in _zhilian_sign.py)
Config loaded from environment variables, not hardcoded

Current state: Delays are configurable via env vars. Cookie handling and platform-specific behavior is hardcoded in each crawler.

Complexity: Low-Medium — config dataclass refactor.

D3. Company Inline Crawl During Job Search

Value: Crawling company details in the same session as job search reduces total requests, shares the same authenticated session/proxy, and keeps company data fresh relative to job postings.

What this means concretely:

After each page of job results: extract unique company IDs from jobs, fetch company details for new ones
Session-level seen_companies set prevents re-fetching same company in same run
Falls back gracefully if company fetch fails — job data is not blocked on company data
Already implemented in spiderJobs/runner/loop.py as _crawl_companies_from_jobs() — this is the reference

Current state: Implemented in spiderJobs/runner/. Missing from jobs_spider/.

Complexity: Low — pattern exists, needs to be the standard in unified loop.

D4. Keyword Progress Persistence and Resume

Value: ECS instances can be terminated mid-crawl (spot instance preemption). Without persisted progress, the entire keyword must restart. With last_completed_page, the crawler resumes from where it stopped.

What this means concretely:

Backend stores last_completed_page per keyword (already in keyword model)
Crawler reads start_page = last_completed_page + 1 from keyword response
On each successful page: report_page_progress(keyword_id, page) updates backend
On spot instance termination: next invocation continues from last checkpoint

Current state: spiderJobs/runner/loop.py fully implements this. jobs_spider/ does not.

Complexity: Low — spiderJobs/runner/ is the reference implementation.

D5. Scheduler Task Run History + Alerting

Value: When company cleaning fails silently or a scheduled task stops running (due to a lock deadlock), operators need visibility without watching logs.

What this means concretely:

ScheduledTaskRun table already records task runs with success/failure status
Email alert when: task fails N consecutive times, or hasn't run in 2x expected interval
ip_alert_job (every 10 min) already does IP anomaly alerting via SMTP — extend this pattern
Frontend: scheduler health section in monitor dashboard

Current state: ScheduledTaskRun model + _record_task_run() exists. Email alerting exists for IP anomalies. Not wired for task-level failures.

Complexity: Medium — extend existing alerting pattern.

D6. Unified Crawl Result Reporting API

Value: Crawlers that report completion status enable automatic keyword queue management — completed keywords can be reset for the next crawl cycle, failed ones flagged for retry.

What this means concretely:

POST /api/v1/keyword/crawl-complete with {keyword_id, status, pages_completed, jobs_found, error_message}
Backend updates keyword state: completed → reset to pending after 24h; failed → increment failure count
Already partially designed in spiderJobs/runner/api_client.py

Current state: report_crawl_complete() exists in spiderJobs/runner/api_client.py. Not wired to keyword state machine in backend.

Complexity: Medium — backend endpoint + keyword state machine.

Anti-Features

Things to deliberately NOT build. These appear useful but create more problems than they solve in this context.

AF1. Shared Crawl State via Redis/Database

What it looks like: Storing proxy health, ban status, session cookies in Redis so multiple ECS nodes share state.

Why avoid: Adds Redis as a hard dependency for crawlers. ECS spot instances are ephemeral — shared state becomes stale or corrupted when nodes die mid-operation. Each crawler process managing its own in-memory state is simpler, more reliable, and eliminates inter-node coordination. The existing SmartIPManager in-memory approach is correct.

What to do instead: Each ECS node gets its own proxy pool slice. Node-level state stays in-process. Only keyword assignment (which is already serialized via the backend API) needs cross-node coordination.

AF2. Playwright / Browser Automation as Default

What it looks like: Using headless Chrome via Playwright for all platform requests to bypass detection.

Why avoid: Playwright adds 3–5x resource overhead per request. Boss直聘's WeChat mini-program API and QCWY/Zhilian's mobile APIs are accessible via direct HTTP with correct headers and TLS fingerprints (requests_go + TLS_CHROME_LATEST). Browser automation is only justified when the platform serves JS-rendered content with no underlying API — which is not the case for any of the three target platforms. Playwright is already listed as a dependency but should be a last-resort escape hatch, not the default.

What to do instead: Keep requests_go with TLS_CHROME_LATEST and random JA3 as the standard HTTP client. Use Playwright only for specific cookie acquisition flows if needed.

AF3. Real-time Data Processing / Stream Pipeline

What it looks like: Adding Kafka/Flink/Redis Streams to process job data as it arrives.

Why avoid: The current volume (thousands of jobs per day) does not justify stream infrastructure. ClickHouse handles batch inserts efficiently. The 5-minute company cleaning scheduler + async ingest pipeline covers all real-time requirements. Adding a streaming layer multiplies operational complexity with no data quality benefit at this scale.

What to do instead: Keep the batch-insert-with-dedup pattern. Optimize batch sizes if needed (current single-batch-per-page is correct).

AF4. ML-Based Anti-Detection (Behavioral Fingerprinting)

What it looks like: Training ML models to randomize mouse movement patterns, request timing distributions, or user session behavior to mimic human browsing.

Why avoid: The three target platforms are mobile/mini-program APIs, not web browser interfaces. Behavioral fingerprinting only matters for full browser sessions. The current approach (random delays 10–20s, TLS fingerprint spoofing, IP rotation) addresses the actual detection vectors. ML-based solutions add training/maintenance cost with no benefit against the actual detection mechanisms used.

What to do instead: Tune the existing IPStrategyConfig parameters (delay range, failure thresholds) per platform via environment variables.

AF5. Crawler-Side Data Normalization / Schema Transformation

What it looks like: Having crawlers parse, clean, and normalize job data fields (salary parsing, city name normalization, skill extraction) before pushing to the backend.

Why avoid: Normalization logic changes frequently as platforms update their response schemas. If normalization breaks, raw data is already lost — it was never stored. Storing raw JSON and normalizing downstream (in ClickHouse views or a separate cleaning service) preserves the original data for re-processing. The existing CleaningService covers this correctly.

What to do instead: Crawlers push raw JSON. Backend's IngestService stores raw JSON in json_data column. ClickHouse views and CleaningService handle field extraction and normalization.

AF6. Separate Crawler Orchestration Database

What it looks like: A dedicated database (separate from MySQL/ClickHouse) to manage crawler state, job queues, worker registrations.

Why avoid: The existing keyword table in MySQL + IpUploadStats + ScheduledTaskRun already provide sufficient crawler state management. Adding a third database (e.g., MongoDB for crawler state) fragments operational concerns and adds another system to maintain. The backend API is the correct coordination point for ECS crawler nodes.

What to do instead: Extend the keyword model with the fields needed for state tracking (last_completed_page, status, failure_count). The existing pattern is sufficient.

Feature Dependencies

Unified Base Classes (1) → IP/Proxy Management (6) [proxy logic in shared _http_client.py]
Unified Base Classes (1) → Retry with Backoff (5) [retry wrapper in shared _http_client.py]
Unified Base Classes (1) → Token Lifecycle (7) [token client in shared boss module]
Keyword Dispatch (2) → Per-Page Upload (3) [keyword_id needed for page progress reporting]
Keyword Dispatch (2) → Keyword Progress Resume (D4) [last_completed_page field]
Per-Page Upload (3) → Dedup at Ingestion (4) [smaller batches = faster dedup queries]
IP/Proxy Management (6) → Crawl Health Dashboard (D1) [IpUploadStats feeds dashboard]
Keyword Dispatch (2) → Unified Crawl Reporting API (D6) [keyword state machine]

MVP Recommendation for This Refactor

The refactor's highest-leverage work, in priority order:

Phase 1 — Foundation (unblock everything else):

Unified base classes — canonicalize _base.py, _http_client.py, sign modules into a single importable package
SmartIPManager + IPAnomalyDetector moved to shared module, used by all three platforms
Standardized logging (loguru) in all crawlers — replace print() calls

Phase 2 — Data reliability: 4. Per-page upload loop (following spiderJobs/runner/loop.py pattern) for all three platforms 5. Retry with backoff in _http_client.py 6. All platforms route through IngestService + PlatformConfig registry for dedup

Phase 3 — Operational visibility: 7. Crawl completion reporting API + keyword state machine 8. Crawler health dashboard extension (job ingestion metrics, keyword status)

Defer:

Platform anti-detection profiles (D2): low urgency, current env-var approach is adequate
Scheduler task alerting (D5): existing SMTP infra works; wire task failures as follow-up
Company inline crawl (D3): correct pattern exists in spiderJobs/runner/; include if refactoring that module

Sources

Codebase analysis: .planning/codebase/ARCHITECTURE.md (2026-03-21)
app/services/crawler/_base.py — BaseFetcher/BaseSearcher template method pattern
app/services/crawler/_http_client.py — HTTPClient, proxy priority logic, TLS fingerprint
app/core/algorithms/antispider.py — IPStrategyConfig, IPAnomalyDetector, SmartIPManager
app/services/ingest/dedup.py — batch_dedup_filter, 1/2 column key support
app/services/ingest/service.py — IngestService.store_batch() pipeline
app/services/ingest/registry.py — PlatformConfig registry, DedupFieldSpec
spiderJobs/runner/loop.py — run_crawl_loop(), per-page upload, breakpoint resume, company inline crawl
jobs_spider/boss/boos_api.py (lines 1–100) — SmartIPManager usage, sleep_random_between(), token management
app/core/ip_tracking.py — IpTrackingMiddleware, IpUploadStats
app/core/locks.py — DistributedLock (Redis + file fallback)
.planning/PROJECT.md — refactor scope, constraints, key decisions

23 KiB Raw Permalink Blame History Unescape Escape

Feature Landscape

Current State Baseline

Table Stakes

1. Unified Platform Base Classes

2. Keyword-Driven Dispatch with Crawl State Tracking

3. Per-Page Batch Upload (Not End-of-Keyword Batch)

4. Dedup at Ingestion (ClickHouse Query Before Insert)

5. Retry with Exponential Backoff

6. IP/Proxy Management with Ban Detection

7. Token Lifecycle Management (Boss 直聘)

8. Structured Error Logging per Crawl Session

Differentiators

D1. Crawl Health Dashboard (Frontend)

D2. Platform-Specific Anti-Detection Profiles

D3. Company Inline Crawl During Job Search

D4. Keyword Progress Persistence and Resume

D5. Scheduler Task Run History + Alerting

D6. Unified Crawl Result Reporting API

Anti-Features

AF1. Shared Crawl State via Redis/Database

AF2. Playwright / Browser Automation as Default

AF3. Real-time Data Processing / Stream Pipeline

AF4. ML-Based Anti-Detection (Behavioral Fingerprinting)

AF5. Crawler-Side Data Normalization / Schema Transformation

AF6. Separate Crawler Orchestration Database

Feature Dependencies

MVP Recommendation for This Refactor

Sources

23 KiB

Raw Permalink Blame History