# Domain Pitfalls: 招聘数据爬虫重构 **Domain:** Chinese recruitment platform crawler refactoring (Boss直聘 / 前程无忧 / 智联招聘) **Researched:** 2026-03-21 **Confidence:** HIGH — all findings are grounded in direct codebase evidence from this repository --- ## Critical Pitfalls Mistakes that cause rewrites, data loss, or crawler shutdown. --- ### Pitfall 1: Breaking the Running Crawler While Refactoring **What goes wrong:** You refactor `jobs_spider/boss/boos_api.py` or `app/services/crawler/` in-place. The new base class has a subtly different interface (parameter order, return type, exception handling). The old code path that was working for production data collection silently stops collecting, or starts pushing malformed data into ClickHouse. **Why it happens:** This codebase already has three parallel spider implementations (`jobs_spider/`, `spiderJobs/`, `app/services/crawler/`) that evolved independently. The temptation during refactoring is to "just migrate" the live code directly. But the live scripts (`jobs_spider/boss/boos_api.py`) run as standalone ECS processes with a persistent loop — there is no deployment gate. Any change to the shared module takes effect on the next ECS restart or local re-run, with no rollback. **Consequences:** - Zero data ingested for the duration of the broken crawler run (hours or days on ECS) - No test coverage currently means no automated detection of breakage - ClickHouse dedup logic will not surface "nothing came in" as an error **Prevention:** - Always keep the old code path alive and parallel until the new one has ingested at least one full keyword sweep successfully - Use the feature-flag pattern: add a `USE_NEW_CRAWLER=0/1` env variable; default to old code; validate new code side-by-side before switching default - Write a data-volume canary check: if a keyword cycle produces 0 records when it historically produces 50+, alert immediately rather than silently succeeding **Detection (warning signs):** - `push_job()` returns success but ClickHouse row count for today is lower than yesterday's baseline - Crawler log shows "抓取完成" with 0 records for keywords that reliably returned results before - `batch_dedup_filter` ignored count is unexpectedly 100% (all records already existed, possibly because empty batches are silently accepted) **Phase:** Address in the very first refactoring phase, before touching any shared path. Establish the parallel-run validation pattern before any code changes. --- ### Pitfall 2: Copying Sign/Algorithm Code Instead of Sharing It **What goes wrong:** The sign algorithm (`_compute_checksum`, `_generate_uuid`, `BossSign`) is already duplicated verbatim between `spiderJobs/platforms/boss/sign.py` and `app/services/crawler/_boss_sign.py`. The comment in `_boss_sign.py` says "复制自 spiderJobs/platforms/boss/sign.py". This copy-paste pattern will be repeated for the new unified base: if the shared core lives in `app/`, the `jobs_spider/` standalone scripts cannot import it without adding the app to `sys.path` at runtime, which is fragile. The solution chosen in the current code — "copy the file with a comment" — is not sharing; it is duplication with extra steps. **Why it happens:** Python's import system makes cross-directory sharing feel difficult when one consumer is an app package and the other is a standalone script. The easy path is copy-paste, which maintains independence but recreates the bug-propagation problem. **Consequences:** - A future Boss anti-bot update requires changing the sign algorithm. The fix must be applied in 2+ places. One place is missed. One spider silently generates invalid traceid headers and gets 403s — but since there is no test, this is only discovered when data stops appearing. - Same problem will recur for HTTP client, IP manager, and data parser **Prevention:** - Extract shared crawler core into a proper installable package (`crawler_core/`) or a well-defined subdirectory with a single source of truth - The `jobs_spider/` scripts should `sys.path.insert(0, repo_root)` and import from the shared package — one authoritative source - Alternatively: keep `app/services/crawler/` as the canonical location, and have `jobs_spider/` scripts import directly from `app.services.crawler.*` after setting `PYTHONPATH` - Do NOT use the "复制自 X" comment pattern — it is a documentation of failure, not a solution **Detection:** - Run `diff spiderJobs/platforms/boss/sign.py app/services/crawler/_boss_sign.py` — if they differ, a divergence has already occurred - Any comment containing "复制自" or "copy from" is a warning sign of duplicated code **Phase:** Must be the foundational decision in Phase 1 before any platform-specific work. The sharing mechanism must be established first; all subsequent platform rewrites depend on it. --- ### Pitfall 3: Anti-Bot State Is Instance-Local, Not Process-Global **What goes wrong:** The refactored `SmartIPManager`, token cache, and session objects are created fresh per-request when embedded in the FastAPI async service (`CleaningService`, `CompanyCleaner`). Each call to `CleaningService` instantiates a new `BossService`, which creates a new `BossSign` and a new `HTTPClient` with a new session. The anti-bot state (IP failure counters, cooldown timers, session cookies) that Boss直聘 relies on to not trigger rate-limiting is discarded after each request. **Why it happens:** The current `CleaningService.__init__` creates `self.boss_service = BossService()` per instance. `CleaningService` itself is instantiated per API request. This works for the standalone script (one long-lived process, one `SmartIPManager`) but the FastAPI service context has a completely different lifecycle. **Consequences:** - The standalone scripts maintain session state correctly (SmartIPManager accumulates failure counts, applies cooldowns) - The embedded service makes "cold" requests every time, resetting the TLS fingerprint session and appearing as new clients — which is exactly what anti-bot systems profile - The `__zp_sseed__` / `__zp_sname__` / `__zp_sts__` cookie state managed in `boos_api.py` (anti-bot response at code=37) exists only in the single-process script **Prevention:** - The shared core (IP manager, session, token cache) must be process-level singletons when used inside FastAPI, not per-request instances - Use FastAPI's `lifespan` to initialize a single shared crawler context; inject it via dependency - For the standalone scripts, the existing per-process instance model is correct and should not change **Detection:** - Boss returns `code=35` (IP banned) or `code=37` (anti-bot challenge) on the first request from the FastAPI service context - High 403/429 rate from embedded cleaning service, while standalone scripts on the same IP work fine **Phase:** Phase addressing the FastAPI embedded crawler service. Do not assume the standalone script's session management model transfers to async service context. --- ### Pitfall 4: ClickHouse Dedup Scans the Full Table **What goes wrong:** `batch_dedup_filter` in `app/services/ingest/dedup.py` issues `SELECT job_id FROM job_data.boss_job WHERE job_id IN (...)` with no time-bound filter. For `boss_job` after months of operation, this scans every row. The MergeTree `ORDER BY created_at` means `job_id` is not in the primary key — the scan will read the entire column. At scale (millions of rows), this dedup check becomes the bottleneck and eventually causes timeouts, causing the ingest service to fallback to `check_duplicate=False` (silently inserting duplicates). **Why it happens:** The current schema uses `ORDER BY created_at` for all tables (confirmed in `clickhouse_init.py`). Dedup columns (`job_id`, `number`, etc.) are secondary columns with no index. This was fine when tables were small. **Consequences:** - Ingest latency spikes as tables grow, eventually timing out - Silent duplicate insertion when dedup times out and the batch is inserted anyway - The `CONCERNS.md` already flags this — but the fix (adding `PREWHERE created_at > now() - INTERVAL N DAY`) cannot be done as a view change; it requires changing the dedup query logic **Prevention:** - Add a configurable `days_back` parameter (defaulting to 30) to all dedup queries: `WHERE job_id IN (...) AND created_at >= now() - INTERVAL 30 DAY` - This matches the existing `company_cleaner.py` `days_back=90` pattern already in the codebase - During refactoring, add this as a mandatory parameter to `batch_dedup_filter` — not an optional afterthought - Consider adding a secondary index (`INDEX idx_job_id job_id TYPE bloom_filter GRANULARITY 4`) in the DDL for tables where dedup is critical **Detection:** - ClickHouse query log shows `dedup_filter` queries scanning >1M rows with no `created_at` filter - Ingest endpoint latency increases week-over-week as table grows - `duplicate` count in ingest result drops to 0 even though the same crawler ran yesterday (dedup failing silently) **Phase:** Phase 1 (dedup infrastructure). This is a data correctness issue, not a performance optimization. Fix before the refactored crawlers start pushing higher volumes. --- ## Moderate Pitfalls ### Pitfall 5: Sync `requests` Blocking the Async Event Loop **What goes wrong:** The refactored crawler services use `requests` (sync) wrapped in `asyncio.to_thread()`. Under concurrent cleaning requests or batch company processing, the thread pool fills up. The `CleaningService` spawns 50 company cleanings in one scheduler tick — each calls `to_thread(boss_service.get_company_detail_by_id)`. With uvicorn's default thread pool (typically `min(32, os.cpu_count() + 4)` threads), 20+ concurrent cleaning tasks saturate the pool, and subsequent requests queue behind them. **Why it happens:** The crawler libraries (`requests`, `requests_go`) are synchronous. Wrapping them with `to_thread` is the correct pattern, but the thread pool limit is invisible until load hits it. **Prevention:** - Use `asyncio.Semaphore` to cap concurrency of `to_thread` calls within the cleaning scheduler — cap at `(thread_pool_size // 3)` to leave headroom for other async operations - Alternatively, migrate to `httpx.AsyncClient` for the embedded service context (not the standalone scripts, which can stay synchronous) - Set `UVICORN_WORKERS` and thread pool size explicitly; document the relationship **Detection:** - Event loop lag exceeds 100ms during scheduled cleaning runs - `asyncio.to_thread` queue depth grows without bound during batch operations **Phase:** Phase addressing `CompanyCleaner` and `CleaningService` concurrency. Low urgency until company processing batch size increases. --- ### Pitfall 6: ReplacingMergeTree Dedup Is Eventual, Not Immediate **What goes wrong:** `pending_company` uses `ENGINE = ReplacingMergeTree(version)`. The assumption is that inserting the same `(source, company_id)` twice will result in one row. This is wrong at query time. ClickHouse only deduplicates during background merges, which happen asynchronously. Between insertion and the next merge, both rows exist and will appear in `SELECT`. The `CompanyCleaner.collect_pending_companies()` query that reads `pending_company` may return duplicate company IDs, causing duplicate processing and potentially duplicate entries in MySQL. **Why it happens:** ReplacingMergeTree is commonly misunderstood as providing immediate deduplication. It does not. The documentation is clear, but the confusion is pervasive. **Prevention:** - All queries against `pending_company` must use `SELECT ... FROM pending_company FINAL` to force merge semantics at query time - Or: add `WHERE created_at = max(created_at) GROUP BY source, company_id` — but `FINAL` is simpler and correct - The insert logic should also guard against inserting the same `(source, company_id)` in the same batch (pre-filter in application code) **Detection:** - `collect_pending_companies()` returns the same `company_id` twice in one batch - MySQL company table gets duplicate entries from two `collect` runs in quick succession **Phase:** Phase addressing company data pipeline. Must be fixed before increasing `collect_pending_companies` batch size. --- ### Pitfall 7: The Three-Framework Problem Causes Refactoring Scope Creep **What goes wrong:** The project currently has `jobs_spider/`, `spiderJobs/`, and `app/services/crawler/` — three partially overlapping crawler implementations. During refactoring, there is a strong temptation to "also fix" the `spiderJobs/` framework, or to make the new base class compatible with all three. This expands the scope from "refactor three platforms" to "unify three frameworks," which is a completely different project. **Why it happens:** `spiderJobs/` contains a cleaner architecture (`core/`, `platforms/`, `runner/`) that looks like the intended target. It is tempting to treat it as the destination rather than additional tech debt. **Prevention:** - Make a single explicit decision at the start: `spiderJobs/` is deleted or kept. Do not leave it in a half-migrated state. - If deleted: do it in the first commit of the refactoring phase. An empty `spiderJobs/` with a `DEPRECATED.md` is fine. A half-migrated `spiderJobs/` that is 60% replaced is not. - The refactoring target is the new `_base.py`, `_boss_api.py`, `_boss_client.py`, `_boss_sign.py` pattern already started in `app/services/crawler/` — finish that pattern, don't restart with `spiderJobs/` **Detection:** - A PR diff that touches `spiderJobs/` files during "platform rewrite" work is scope creep - Story points/time estimate doubles from what was planned **Phase:** Decision must be made and documented in Phase 1 before any platform work begins. --- ### Pitfall 8: Testing Crawlers by Hitting Real APIs **What goes wrong:** Unit tests for crawler code make real HTTP requests to Boss直聘, 前程无忧, or 智联招聘. This causes: (a) tests fail in CI with no network access; (b) real request volume from CI triggers rate limiting or bans the test account's IP; (c) test results are non-deterministic (platform API changes break tests unexpectedly); (d) tests cannot run at all without valid tokens. **Why it happens:** There are currently zero crawler tests. When tests are first written, the path of least resistance is to write integration tests against real endpoints. This "just works" locally but creates all of the above problems. **Prevention:** - All unit tests for crawler logic must use `unittest.mock.patch` or `pytest-mock` to mock the HTTP layer - Test the layers independently: sign algorithm (pure function, no HTTP needed), response parser (provide fixture JSON, no HTTP needed), IP manager state machine (mock requests.Session) - Maintain a `tests/fixtures/` directory with saved real API response JSON (anonymized if needed) — test parsers against these fixtures, not live responses - A single integration test suite (marked `@pytest.mark.integration`) can hit real APIs, but it should be opt-in and never run in CI **Detection:** - Test suite requires `API_BASE_URL`, `BOSS_TOKEN`, or network access to pass - Tests fail with `ConnectionRefusedError` or `TimeoutError` in CI **Phase:** Phase establishing the test infrastructure. Write mocked tests first — before implementing the refactored code — following TDD discipline. --- ## Minor Pitfalls ### Pitfall 9: Log File Creation in CWD Breaks When CWD Changes **What goes wrong:** Both `jobs_spider/boss/boos_api.py` and `jobs_spider/qcwy/qcwy.py` contain `os.makedirs("logs", exist_ok=True)` and `logger.add("logs/log_{time:YYYY-MM-DD}.log", ...)`. These create a `logs/` directory relative to the current working directory when the script is launched. If ECS scripts are run from a different directory (e.g., `/root` instead of `/app`), logs appear in an unexpected location, making debugging harder. **Prevention:** Use `pathlib.Path(__file__).parent / "logs"` for script-relative log paths. **Phase:** Minor cleanup, address during standalone script refactoring. --- ### Pitfall 10: Hardcoded Proxy URL Leaks Credentials **What goes wrong:** `jobs_spider/qcwy/qcwy.py` line 31 contains `PROXY_URL = "http://t13319619426654:ln8aj9nl@s432.kdltps.com:15818"` — a real proxy credential hardcoded in source. This is committed to the repo. **Prevention:** Remove immediately. Require `PROXY_URL` via environment variable with no default. Rotate the credential. **Phase:** Pre-refactoring security fix. Do not include in any commit that goes to a shared remote. --- ### Pitfall 11: ClickHouse Schema Cannot Be Evolved Without Manual DDL **What goes wrong:** All six data tables use `CREATE TABLE IF NOT EXISTS`. Adding a new column to the DDL (e.g., adding `province` to `boss_job`) does nothing on a production database where the table already exists. The refactored crawler may produce records with a new field that the ClickHouse schema does not have, causing insert failures or silent field drops. **Prevention:** For each schema change, write an explicit `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` statement in `clickhouse_init.py`'s upgrade path. Run this on startup using a version-tracking approach. Do not rely on `CREATE TABLE IF NOT EXISTS` for evolution. **Phase:** Any phase that changes what fields the crawler stores. Medium urgency — only becomes critical when schema changes are needed. --- ## Phase-Specific Warnings | Phase Topic | Likely Pitfall | Mitigation | |-------------|----------------|------------| | Establish shared crawler core | Pitfall 2 (copy-paste instead of share) | Decide the import mechanism first; prove it works with one sign file before touching anything else | | Rewrite Boss直聘 client | Pitfall 1 (break running crawler) | Keep old `boos_api.py` running; use env flag to switch; validate side-by-side | | Rewrite Boss直聘 client | Pitfall 3 (anti-bot state lost) | Test the new client as a standalone script with a real keyword before embedding in FastAPI | | ClickHouse dedup refactoring | Pitfall 4 (full table scan) | Add `days_back` parameter as non-optional; add integration test with time bounds | | Company pipeline optimization | Pitfall 6 (ReplacingMergeTree eventual) | Add `FINAL` to all `pending_company` SELECT queries before increasing batch size | | Write tests | Pitfall 8 (real API in tests) | Mock at the `HTTPClient.get/post` level, not at the platform API level | | Delete duplicate framework | Pitfall 7 (scope creep) | Delete `spiderJobs/` explicitly in first PR; do not migrate it | --- ## Sources All findings are grounded in direct codebase inspection of this repository: - `/Users/win/2025/AICoding/JobData/.planning/codebase/CONCERNS.md` — security and tech debt audit - `/Users/win/2025/AICoding/JobData/app/services/ingest/dedup.py` — dedup implementation - `/Users/win/2025/AICoding/JobData/app/services/ingest/service.py` — ingest service - `/Users/win/2025/AICoding/JobData/app/core/clickhouse_init.py` — schema DDL (MergeTree ORDER BY created_at, not dedup columns) - `/Users/win/2025/AICoding/JobData/app/services/crawler/_boss_sign.py` vs `/Users/win/2025/AICoding/JobData/spiderJobs/platforms/boss/sign.py` — confirmed verbatim duplication - `/Users/win/2025/AICoding/JobData/app/services/crawler/_base.py` — existing base class - `/Users/win/2025/AICoding/JobData/app/services/crawler/_http_client.py` — HTTP layer - `/Users/win/2025/AICoding/JobData/app/services/cleaning.py` — async service wrapping sync crawlers - `/Users/win/2025/AICoding/JobData/jobs_spider/qcwy/qcwy.py` — hardcoded proxy credential, sleep defaults differ from Boss - `/Users/win/2025/AICoding/JobData/jobs_spider/boss/boos_api.py` — 2281-line god file, SmartIPManager - `/Users/win/2025/AICoding/JobData/.planning/PROJECT.md` — refactoring constraints and goals