JobData/.planning/research/FEATURES.md

# Feature Landscape

**Domain:** Multi-platform job data crawler framework (Boss直聘 / 前程无忧 / 智联招聘)
**Researched:** 2026-03-21
**Context:** Refactoring existing crawler system — not a greenfield build. Features are evaluated against what exists and what the refactor must deliver.

---

## Current State Baseline

Before categorizing features, this is what the existing system already has (do not re-build):

| Capability | Implementation | Location |
|------------|---------------|----------|
| Keyword-driven crawl dispatch | `/api/v1/keyword/available` pull model | `app/api/v1/keyword/` |
| Batch data ingestion pipeline | `IngestService` + `PlatformConfig` registry | `app/services/ingest/` |
| ClickHouse dedup on insert | `batch_dedup_filter()` — 1 or 2 column key | `app/services/ingest/dedup.py` |
| Random delay anti-detection | `sleep_random_between()`, 10–20s enforced | `jobs_spider/boss/boos_api.py` |
| Proxy rotation + tunnel proxy | `SmartIPManager`, `HTTPClient` | `app/core/algorithms/antispider.py`, `_http_client.py` |
| IP anomaly detection | `IPAnomalyDetector` — 403/429/407, slow response, code=35 | `_http_client.py` |
| TLS fingerprint spoofing | `requests_go` with `TLS_CHROME_LATEST`, random JA3 | `_http_client.py` |
| Structured logging | `loguru` rotating daily files, 30-day retention | all modules |
| Scheduled job framework | APScheduler, distributed file/Redis lock | `app/core/scheduler.py`, `locks.py` |
| IP upload tracking | `IpTrackingMiddleware` → `IpUploadStats` table | `app/core/ip_tracking.py` |
| Company cleaning pipeline | `CompanyCleaner` collect→process→store, every 5 min | `app/services/company_cleaner.py` |
| Frontend monitoring (partial) | `monitor.vue` — pending queue, today processed, completion rate | `web/src/views/cleaning/monitor.vue` |
| Unified crawl loop (partial) | `spiderJobs/runner/loop.py` — page progress reporting | `spiderJobs/runner/` |
| BaseFetcher / BaseSearcher | Template method base classes | `app/services/crawler/_base.py` |

**Critical gap:** Two parallel crawler frameworks exist (`jobs_spider/` + `spiderJobs/`) with divergent code. The base classes in `app/services/crawler/` are partially implemented but not used by `jobs_spider/`. Dedup logic is duplicated in both frameworks.

---

## Table Stakes

Features the system cannot function correctly without. Missing or broken = data pipeline breaks.

### 1. Unified Platform Base Classes

**Why required:** Without a single `BasePlatformClient` that `jobs_spider/`, `app/services/crawler/`, AND `spiderJobs/` all import from, every bug fix or anti-detection update must be applied in 3 places — guaranteed divergence.

**What this means concretely:**
- Single `_base.py` in a shared package (not copied between directories)
- `BaseFetcher`, `BaseSearcher`, `ApiResult`, `parse_response` defined once
- Each platform implements `_build_params()`, `_parse()` — nothing else
- Sign/auth modules (`_boss_sign.py`, `_zhilian_sign.py`, etc.) imported from the shared package

**Complexity:** Medium — requires restructuring import paths, not rewriting logic.

| Sub-feature | Status | Notes |
|-------------|--------|-------|
| Single base class file | Partial — `app/services/crawler/_base.py` exists but is a copy, not the source of truth | Must become canonical |
| Shared sign modules | Partial — `_boss_sign.py` etc. exist in `app/services/crawler/` | `jobs_spider/` uses its own copies |
| `spiderJobs/` consolidation | Missing — `spiderJobs/core/base.py` is yet another copy | Must be eliminated or unified |

---

### 2. Keyword-Driven Dispatch with Crawl State Tracking

**Why required:** Without atomic keyword assignment with state tracking, multiple ECS nodes will crawl the same keyword simultaneously, producing massive duplicates and wasting API quota.

**What this means concretely:**
- `GET /api/v1/keyword/available?reserve=True` atomically marks a keyword as in-progress
- Keyword state machine: `pending → in_progress → completed / failed`
- `last_completed_page` persisted so interrupted crawls can resume from breakpoint
- Crawlers report `status=completed|failed` + page count on finish
- The `spiderJobs/runner/loop.py` already implements `report_crawl_complete()` and `report_page_progress()` — this is the model to follow

**Complexity:** Low — pattern already exists in `spiderJobs/runner/`, needs to be the standard.

| Sub-feature | Status | Notes |
|-------------|--------|-------|
| Atomic keyword assignment | Exists in `jobs_spider/boss/boos_api.py` | Not in qcwy/zhilian |
| Crawl completion reporting | Exists in `spiderJobs/runner/loop.py` | Not in `jobs_spider/` |
| Breakpoint resume (last_completed_page) | Exists in `spiderJobs/runner/loop.py` | Not in `jobs_spider/` |

---

### 3. Per-Page Batch Upload (Not End-of-Keyword Batch)

**Why required:** If a crawler accumulates all pages before pushing, a crash at page 9 of 10 loses everything. Uploading per-page means each page's data is safe before moving to the next.

**What this means concretely:**
- After each page fetch: `POST /api/v1/universal/data/batch-store-async` with that page's data
- Do not accumulate across pages in memory
- Report page progress to backend (`report_page_progress()`)
- `spiderJobs/runner/loop.py` does this correctly — `jobs_spider/boss/boos_api.py` accumulates and pushes at the end

**Complexity:** Low — structural change to push loop.

| Sub-feature | Status | Notes |
|-------------|--------|-------|
| Per-page upload | Missing in `jobs_spider/boss/` | `spiderJobs/runner/` has it |
| Per-page progress reporting | Missing in `jobs_spider/` | `spiderJobs/runner/` has it |

---

### 4. Dedup at Ingestion (ClickHouse Query Before Insert)

**Why required:** Without dedup, repeated crawls of the same keyword accumulate duplicate job records, corrupting analytics counts. ClickHouse's MergeTree eventual dedup is not sufficient for real-time correctness.

**What this means concretely:**
- `batch_dedup_filter()` queries ClickHouse for existing keys before inserting
- Dedup key per platform: Boss = `job_id`; QCWY = `(job_id, update_date_time)`; Zhilian = `(number, first_publish_time)`
- Dedup happens in `IngestService`, not in the crawler — crawler just pushes raw data
- All three platforms must route through the same `IngestService` (currently QCWY has separate logic)

**Complexity:** Low — `IngestService` + `PlatformConfig` registry already works; just need all platforms to use it.

| Sub-feature | Status | Notes |
|-------------|--------|-------|
| `batch_dedup_filter()` | Exists and works | Supports 1 or 2 key columns |
| Boss dedup via registry | Exists | Working |
| QCWY dedup via registry | Partial | Separate code path; needs unification |
| Zhilian dedup via registry | Partial | Same issue |

---

### 5. Retry with Exponential Backoff

**Why required:** Platform APIs are unstable — rate limits, transient 5xx errors, proxy failures. Without retry, a single flaky response aborts an entire crawl session.

**What this means concretely:**
- Retry on: network timeout, `requests.exceptions.RequestException`, HTTP 5xx, platform business errors that are transient (not bans)
- Do NOT retry on: HTTP 403/429 IP ban signals (switch proxy instead), HTTP 401 token expired (refresh token instead)
- Max retries: 3 attempts with exponential backoff (2s, 4s, 8s jitter)
- After max retries: log error, report `status=failed` to backend, continue to next keyword (do not crash)

**Current state:** No structured retry in either `jobs_spider/` or `app/services/crawler/`. Failures cause immediate abort.

**Complexity:** Medium — needs a retry decorator/wrapper in `_http_client.py`.

| Sub-feature | Status | Notes |
|-------------|--------|-------|
| Retry on transient errors | Missing | Critical gap |
| Exponential backoff | Missing | |
| Distinguish retryable vs ban | Missing | IPAnomalyDetector detects bans but retry is not wired |

---

### 6. IP/Proxy Management with Ban Detection

**Why required:** Chinese recruitment platforms aggressively detect and block scraper IPs. Without automated ban detection and proxy switching, crawlers die silently and collect no data.

**What this means concretely:**
- `IPAnomalyDetector` checks HTTP 403/429/407, slow response (>5s), business code=35, message containing "IP地址存在异常"
- On ban: switch proxy immediately (not after retry), rebuild `requests.Session`, log ban event
- `SmartIPManager` handles: tunnel proxy → proxy pool round-robin → local direct connection with cooldown
- Proxy health state is in-memory per crawler process (no shared state needed — each ECS node manages its own pool)

**Current state:** EXISTS and works in `jobs_spider/boss/boos_api.py`. NOT present in qcwy or zhilian crawlers. Must be shared via base class.

**Complexity:** Low — logic exists, needs to be in shared module not duplicated.

| Sub-feature | Status | Notes |
|-------------|--------|-------|
| `IPAnomalyDetector` | Exists in boss + antispider.py | Duplicated — must consolidate |
| `SmartIPManager` | Exists in boss | Missing from qcwy/zhilian |
| Session rebuild on ban | Exists in boss | Missing from qcwy/zhilian |

---

### 7. Token Lifecycle Management (Boss 直聘)

**Why required:** Boss直聘 uses MPT tokens (WeChat mini-program login tokens) that expire. Without automatic refresh, crawlers stop collecting data silently.

**What this means concretely:**
- `GET /api/v1/token/tokens` returns available active Boss tokens
- On token-expired response (business code indicating auth failure): mark token as inactive, fetch next available token
- If no tokens available: log error, wait, retry — do not crash
- Token management is Boss-specific; QCWY and Zhilian use cookie/session auth

**Current state:** Exists in `jobs_spider/boss/boos_api.py`. Needs to be in the shared Boss client module.

**Complexity:** Low — already implemented, just needs to stay in shared code.

---

### 8. Structured Error Logging per Crawl Session

**Why required:** Without structured logs, diagnosing why data volume dropped on a given day is impossible. Operators cannot tell if the issue is IP bans, token expiry, keyword exhaustion, or platform API changes.

**What this means concretely:**
- `loguru` rotating daily log files per platform, 30-day retention (already exists)
- Each log entry must include: platform, keyword (city+job), page, error_type, proxy_used, http_status
- Log at: session start, each page success/failure, proxy switch, token switch, crawl completion
- Errors forwarded to backend `ScheduledTaskRun` table for dashboard visibility

**Current state:** Logging exists but is inconsistent — boss uses loguru, qcwy/zhilian use `print()`. Must standardize.

**Complexity:** Low — configuration change + replacing `print()` with `logger`.

---

## Differentiators

Features that make the system operationally robust and reduce manual intervention. Not strictly required for data collection, but essential for unattended production operation.

### D1. Crawl Health Dashboard (Frontend)

**Value:** Operators can see at a glance: which platforms are crawling, how many jobs were ingested today, which IPs are active/banned, whether company cleaning is keeping up with job volume.

**What this means concretely:**
- Expand existing `monitor.vue` to cover job crawling (not just company cleaning)
- Per-platform: jobs ingested today (from `IpUploadStats`), keywords completed/pending/failed, active crawler IPs
- Alert indicators: zero ingestion in last 30 min, all keywords exhausted, token pool empty
- Auto-refresh every 30s (already exists for company cleaning panel)
- Backend API: `/api/v1/analytics/crawler-health` aggregating from `IpUploadStats` + `keyword` tables

**Complexity:** Medium — requires new backend analytics endpoint + frontend component.

---

### D2. Platform-Specific Anti-Detection Profiles

**Value:** Each platform has different detection mechanisms. A unified anti-detection config per platform reduces manual tuning when bans increase.

**What this means concretely:**
- Per-platform config dataclass: `min_delay`, `max_delay`, `max_pages_per_keyword`, `user_agent_rotation`, `cookie_fields_to_preserve`
- Boss: must preserve `__zp_sseed__`, `__zp_sname__`, `__zp_sts__` cookies (anti-bot challenge)
- QCWY: standard cookie session, no special handling
- Zhilian: custom header signing (already in `_zhilian_sign.py`)
- Config loaded from environment variables, not hardcoded

**Current state:** Delays are configurable via env vars. Cookie handling and platform-specific behavior is hardcoded in each crawler.

**Complexity:** Low-Medium — config dataclass refactor.

---

### D3. Company Inline Crawl During Job Search

**Value:** Crawling company details in the same session as job search reduces total requests, shares the same authenticated session/proxy, and keeps company data fresh relative to job postings.

**What this means concretely:**
- After each page of job results: extract unique company IDs from jobs, fetch company details for new ones
- Session-level `seen_companies` set prevents re-fetching same company in same run
- Falls back gracefully if company fetch fails — job data is not blocked on company data
- Already implemented in `spiderJobs/runner/loop.py` as `_crawl_companies_from_jobs()` — this is the reference

**Current state:** Implemented in `spiderJobs/runner/`. Missing from `jobs_spider/`.

**Complexity:** Low — pattern exists, needs to be the standard in unified loop.

---

### D4. Keyword Progress Persistence and Resume

**Value:** ECS instances can be terminated mid-crawl (spot instance preemption). Without persisted progress, the entire keyword must restart. With `last_completed_page`, the crawler resumes from where it stopped.

**What this means concretely:**
- Backend stores `last_completed_page` per keyword (already in keyword model)
- Crawler reads `start_page = last_completed_page + 1` from keyword response
- On each successful page: `report_page_progress(keyword_id, page)` updates backend
- On spot instance termination: next invocation continues from last checkpoint

**Current state:** `spiderJobs/runner/loop.py` fully implements this. `jobs_spider/` does not.

**Complexity:** Low — `spiderJobs/runner/` is the reference implementation.

---

### D5. Scheduler Task Run History + Alerting

**Value:** When company cleaning fails silently or a scheduled task stops running (due to a lock deadlock), operators need visibility without watching logs.

**What this means concretely:**
- `ScheduledTaskRun` table already records task runs with success/failure status
- Email alert when: task fails N consecutive times, or hasn't run in 2x expected interval
- `ip_alert_job` (every 10 min) already does IP anomaly alerting via SMTP — extend this pattern
- Frontend: scheduler health section in monitor dashboard

**Current state:** `ScheduledTaskRun` model + `_record_task_run()` exists. Email alerting exists for IP anomalies. Not wired for task-level failures.

**Complexity:** Medium — extend existing alerting pattern.

---

### D6. Unified Crawl Result Reporting API

**Value:** Crawlers that report completion status enable automatic keyword queue management — completed keywords can be reset for the next crawl cycle, failed ones flagged for retry.

**What this means concretely:**
- `POST /api/v1/keyword/crawl-complete` with `{keyword_id, status, pages_completed, jobs_found, error_message}`
- Backend updates keyword state: `completed` → reset to `pending` after 24h; `failed` → increment failure count
- Already partially designed in `spiderJobs/runner/api_client.py`

**Current state:** `report_crawl_complete()` exists in `spiderJobs/runner/api_client.py`. Not wired to keyword state machine in backend.

**Complexity:** Medium — backend endpoint + keyword state machine.

---

## Anti-Features

Things to deliberately NOT build. These appear useful but create more problems than they solve in this context.

### AF1. Shared Crawl State via Redis/Database

**What it looks like:** Storing proxy health, ban status, session cookies in Redis so multiple ECS nodes share state.

**Why avoid:** Adds Redis as a hard dependency for crawlers. ECS spot instances are ephemeral — shared state becomes stale or corrupted when nodes die mid-operation. Each crawler process managing its own in-memory state is simpler, more reliable, and eliminates inter-node coordination. The existing `SmartIPManager` in-memory approach is correct.

**What to do instead:** Each ECS node gets its own proxy pool slice. Node-level state stays in-process. Only keyword assignment (which is already serialized via the backend API) needs cross-node coordination.

---

### AF2. Playwright / Browser Automation as Default

**What it looks like:** Using headless Chrome via Playwright for all platform requests to bypass detection.

**Why avoid:** Playwright adds 3–5x resource overhead per request. Boss直聘's WeChat mini-program API and QCWY/Zhilian's mobile APIs are accessible via direct HTTP with correct headers and TLS fingerprints (`requests_go` + `TLS_CHROME_LATEST`). Browser automation is only justified when the platform serves JS-rendered content with no underlying API — which is not the case for any of the three target platforms. Playwright is already listed as a dependency but should be a last-resort escape hatch, not the default.

**What to do instead:** Keep `requests_go` with `TLS_CHROME_LATEST` and random JA3 as the standard HTTP client. Use Playwright only for specific cookie acquisition flows if needed.

---

### AF3. Real-time Data Processing / Stream Pipeline

**What it looks like:** Adding Kafka/Flink/Redis Streams to process job data as it arrives.

**Why avoid:** The current volume (thousands of jobs per day) does not justify stream infrastructure. ClickHouse handles batch inserts efficiently. The 5-minute company cleaning scheduler + async ingest pipeline covers all real-time requirements. Adding a streaming layer multiplies operational complexity with no data quality benefit at this scale.

**What to do instead:** Keep the batch-insert-with-dedup pattern. Optimize batch sizes if needed (current single-batch-per-page is correct).

---

### AF4. ML-Based Anti-Detection (Behavioral Fingerprinting)

**What it looks like:** Training ML models to randomize mouse movement patterns, request timing distributions, or user session behavior to mimic human browsing.

**Why avoid:** The three target platforms are mobile/mini-program APIs, not web browser interfaces. Behavioral fingerprinting only matters for full browser sessions. The current approach (random delays 10–20s, TLS fingerprint spoofing, IP rotation) addresses the actual detection vectors. ML-based solutions add training/maintenance cost with no benefit against the actual detection mechanisms used.

**What to do instead:** Tune the existing `IPStrategyConfig` parameters (delay range, failure thresholds) per platform via environment variables.

---

### AF5. Crawler-Side Data Normalization / Schema Transformation

**What it looks like:** Having crawlers parse, clean, and normalize job data fields (salary parsing, city name normalization, skill extraction) before pushing to the backend.

**Why avoid:** Normalization logic changes frequently as platforms update their response schemas. If normalization breaks, raw data is already lost — it was never stored. Storing raw JSON and normalizing downstream (in ClickHouse views or a separate cleaning service) preserves the original data for re-processing. The existing `CleaningService` covers this correctly.

**What to do instead:** Crawlers push raw JSON. Backend's `IngestService` stores raw JSON in `json_data` column. ClickHouse views and `CleaningService` handle field extraction and normalization.

---

### AF6. Separate Crawler Orchestration Database

**What it looks like:** A dedicated database (separate from MySQL/ClickHouse) to manage crawler state, job queues, worker registrations.

**Why avoid:** The existing keyword table in MySQL + `IpUploadStats` + `ScheduledTaskRun` already provide sufficient crawler state management. Adding a third database (e.g., MongoDB for crawler state) fragments operational concerns and adds another system to maintain. The backend API is the correct coordination point for ECS crawler nodes.

**What to do instead:** Extend the keyword model with the fields needed for state tracking (`last_completed_page`, `status`, `failure_count`). The existing pattern is sufficient.

---

## Feature Dependencies

```
Unified Base Classes (1) → IP/Proxy Management (6) [proxy logic in shared _http_client.py]
Unified Base Classes (1) → Retry with Backoff (5) [retry wrapper in shared _http_client.py]
Unified Base Classes (1) → Token Lifecycle (7) [token client in shared boss module]
Keyword Dispatch (2) → Per-Page Upload (3) [keyword_id needed for page progress reporting]
Keyword Dispatch (2) → Keyword Progress Resume (D4) [last_completed_page field]
Per-Page Upload (3) → Dedup at Ingestion (4) [smaller batches = faster dedup queries]
IP/Proxy Management (6) → Crawl Health Dashboard (D1) [IpUploadStats feeds dashboard]
Keyword Dispatch (2) → Unified Crawl Reporting API (D6) [keyword state machine]
```

---

## MVP Recommendation for This Refactor

The refactor's highest-leverage work, in priority order:

**Phase 1 — Foundation (unblock everything else):**
1. Unified base classes — canonicalize `_base.py`, `_http_client.py`, sign modules into a single importable package
2. `SmartIPManager` + `IPAnomalyDetector` moved to shared module, used by all three platforms
3. Standardized logging (`loguru`) in all crawlers — replace `print()` calls

**Phase 2 — Data reliability:**
4. Per-page upload loop (following `spiderJobs/runner/loop.py` pattern) for all three platforms
5. Retry with backoff in `_http_client.py`
6. All platforms route through `IngestService` + `PlatformConfig` registry for dedup

**Phase 3 — Operational visibility:**
7. Crawl completion reporting API + keyword state machine
8. Crawler health dashboard extension (job ingestion metrics, keyword status)

**Defer:**
- Platform anti-detection profiles (D2): low urgency, current env-var approach is adequate
- Scheduler task alerting (D5): existing SMTP infra works; wire task failures as follow-up
- Company inline crawl (D3): correct pattern exists in `spiderJobs/runner/`; include if refactoring that module

---

## Sources

- Codebase analysis: `.planning/codebase/ARCHITECTURE.md` (2026-03-21)
- `app/services/crawler/_base.py` — BaseFetcher/BaseSearcher template method pattern
- `app/services/crawler/_http_client.py` — HTTPClient, proxy priority logic, TLS fingerprint
- `app/core/algorithms/antispider.py` — IPStrategyConfig, IPAnomalyDetector, SmartIPManager
- `app/services/ingest/dedup.py` — batch_dedup_filter, 1/2 column key support
- `app/services/ingest/service.py` — IngestService.store_batch() pipeline
- `app/services/ingest/registry.py` — PlatformConfig registry, DedupFieldSpec
- `spiderJobs/runner/loop.py` — run_crawl_loop(), per-page upload, breakpoint resume, company inline crawl
- `jobs_spider/boss/boos_api.py` (lines 1–100) — SmartIPManager usage, sleep_random_between(), token management
- `app/core/ip_tracking.py` — IpTrackingMiddleware, IpUploadStats
- `app/core/locks.py` — DistributedLock (Redis + file fallback)
- `.planning/PROJECT.md` — refactor scope, constraints, key decisions