19 KiB
Domain Pitfalls: 招聘数据爬虫重构
Domain: Chinese recruitment platform crawler refactoring (Boss直聘 / 前程无忧 / 智联招聘) Researched: 2026-03-21 Confidence: HIGH — all findings are grounded in direct codebase evidence from this repository
Critical Pitfalls
Mistakes that cause rewrites, data loss, or crawler shutdown.
Pitfall 1: Breaking the Running Crawler While Refactoring
What goes wrong: You refactor jobs_spider/boss/boos_api.py or app/services/crawler/ in-place. The new base class has a subtly different interface (parameter order, return type, exception handling). The old code path that was working for production data collection silently stops collecting, or starts pushing malformed data into ClickHouse.
Why it happens: This codebase already has three parallel spider implementations (jobs_spider/, spiderJobs/, app/services/crawler/) that evolved independently. The temptation during refactoring is to "just migrate" the live code directly. But the live scripts (jobs_spider/boss/boos_api.py) run as standalone ECS processes with a persistent loop — there is no deployment gate. Any change to the shared module takes effect on the next ECS restart or local re-run, with no rollback.
Consequences:
- Zero data ingested for the duration of the broken crawler run (hours or days on ECS)
- No test coverage currently means no automated detection of breakage
- ClickHouse dedup logic will not surface "nothing came in" as an error
Prevention:
- Always keep the old code path alive and parallel until the new one has ingested at least one full keyword sweep successfully
- Use the feature-flag pattern: add a
USE_NEW_CRAWLER=0/1env variable; default to old code; validate new code side-by-side before switching default - Write a data-volume canary check: if a keyword cycle produces 0 records when it historically produces 50+, alert immediately rather than silently succeeding
Detection (warning signs):
push_job()returns success but ClickHouse row count for today is lower than yesterday's baseline- Crawler log shows "抓取完成" with 0 records for keywords that reliably returned results before
batch_dedup_filterignored count is unexpectedly 100% (all records already existed, possibly because empty batches are silently accepted)
Phase: Address in the very first refactoring phase, before touching any shared path. Establish the parallel-run validation pattern before any code changes.
Pitfall 2: Copying Sign/Algorithm Code Instead of Sharing It
What goes wrong: The sign algorithm (_compute_checksum, _generate_uuid, BossSign) is already duplicated verbatim between spiderJobs/platforms/boss/sign.py and app/services/crawler/_boss_sign.py. The comment in _boss_sign.py says "复制自 spiderJobs/platforms/boss/sign.py". This copy-paste pattern will be repeated for the new unified base: if the shared core lives in app/, the jobs_spider/ standalone scripts cannot import it without adding the app to sys.path at runtime, which is fragile. The solution chosen in the current code — "copy the file with a comment" — is not sharing; it is duplication with extra steps.
Why it happens: Python's import system makes cross-directory sharing feel difficult when one consumer is an app package and the other is a standalone script. The easy path is copy-paste, which maintains independence but recreates the bug-propagation problem.
Consequences:
- A future Boss anti-bot update requires changing the sign algorithm. The fix must be applied in 2+ places. One place is missed. One spider silently generates invalid traceid headers and gets 403s — but since there is no test, this is only discovered when data stops appearing.
- Same problem will recur for HTTP client, IP manager, and data parser
Prevention:
- Extract shared crawler core into a proper installable package (
crawler_core/) or a well-defined subdirectory with a single source of truth - The
jobs_spider/scripts shouldsys.path.insert(0, repo_root)and import from the shared package — one authoritative source - Alternatively: keep
app/services/crawler/as the canonical location, and havejobs_spider/scripts import directly fromapp.services.crawler.*after settingPYTHONPATH - Do NOT use the "复制自 X" comment pattern — it is a documentation of failure, not a solution
Detection:
- Run
diff spiderJobs/platforms/boss/sign.py app/services/crawler/_boss_sign.py— if they differ, a divergence has already occurred - Any comment containing "复制自" or "copy from" is a warning sign of duplicated code
Phase: Must be the foundational decision in Phase 1 before any platform-specific work. The sharing mechanism must be established first; all subsequent platform rewrites depend on it.
Pitfall 3: Anti-Bot State Is Instance-Local, Not Process-Global
What goes wrong: The refactored SmartIPManager, token cache, and session objects are created fresh per-request when embedded in the FastAPI async service (CleaningService, CompanyCleaner). Each call to CleaningService instantiates a new BossService, which creates a new BossSign and a new HTTPClient with a new session. The anti-bot state (IP failure counters, cooldown timers, session cookies) that Boss直聘 relies on to not trigger rate-limiting is discarded after each request.
Why it happens: The current CleaningService.__init__ creates self.boss_service = BossService() per instance. CleaningService itself is instantiated per API request. This works for the standalone script (one long-lived process, one SmartIPManager) but the FastAPI service context has a completely different lifecycle.
Consequences:
- The standalone scripts maintain session state correctly (SmartIPManager accumulates failure counts, applies cooldowns)
- The embedded service makes "cold" requests every time, resetting the TLS fingerprint session and appearing as new clients — which is exactly what anti-bot systems profile
- The
__zp_sseed__/__zp_sname__/__zp_sts__cookie state managed inboos_api.py(anti-bot response at code=37) exists only in the single-process script
Prevention:
- The shared core (IP manager, session, token cache) must be process-level singletons when used inside FastAPI, not per-request instances
- Use FastAPI's
lifespanto initialize a single shared crawler context; inject it via dependency - For the standalone scripts, the existing per-process instance model is correct and should not change
Detection:
- Boss returns
code=35(IP banned) orcode=37(anti-bot challenge) on the first request from the FastAPI service context - High 403/429 rate from embedded cleaning service, while standalone scripts on the same IP work fine
Phase: Phase addressing the FastAPI embedded crawler service. Do not assume the standalone script's session management model transfers to async service context.
Pitfall 4: ClickHouse Dedup Scans the Full Table
What goes wrong: batch_dedup_filter in app/services/ingest/dedup.py issues SELECT job_id FROM job_data.boss_job WHERE job_id IN (...) with no time-bound filter. For boss_job after months of operation, this scans every row. The MergeTree ORDER BY created_at means job_id is not in the primary key — the scan will read the entire column. At scale (millions of rows), this dedup check becomes the bottleneck and eventually causes timeouts, causing the ingest service to fallback to check_duplicate=False (silently inserting duplicates).
Why it happens: The current schema uses ORDER BY created_at for all tables (confirmed in clickhouse_init.py). Dedup columns (job_id, number, etc.) are secondary columns with no index. This was fine when tables were small.
Consequences:
- Ingest latency spikes as tables grow, eventually timing out
- Silent duplicate insertion when dedup times out and the batch is inserted anyway
- The
CONCERNS.mdalready flags this — but the fix (addingPREWHERE created_at > now() - INTERVAL N DAY) cannot be done as a view change; it requires changing the dedup query logic
Prevention:
- Add a configurable
days_backparameter (defaulting to 30) to all dedup queries:WHERE job_id IN (...) AND created_at >= now() - INTERVAL 30 DAY - This matches the existing
company_cleaner.pydays_back=90pattern already in the codebase - During refactoring, add this as a mandatory parameter to
batch_dedup_filter— not an optional afterthought - Consider adding a secondary index (
INDEX idx_job_id job_id TYPE bloom_filter GRANULARITY 4) in the DDL for tables where dedup is critical
Detection:
- ClickHouse query log shows
dedup_filterqueries scanning >1M rows with nocreated_atfilter - Ingest endpoint latency increases week-over-week as table grows
duplicatecount in ingest result drops to 0 even though the same crawler ran yesterday (dedup failing silently)
Phase: Phase 1 (dedup infrastructure). This is a data correctness issue, not a performance optimization. Fix before the refactored crawlers start pushing higher volumes.
Moderate Pitfalls
Pitfall 5: Sync requests Blocking the Async Event Loop
What goes wrong: The refactored crawler services use requests (sync) wrapped in asyncio.to_thread(). Under concurrent cleaning requests or batch company processing, the thread pool fills up. The CleaningService spawns 50 company cleanings in one scheduler tick — each calls to_thread(boss_service.get_company_detail_by_id). With uvicorn's default thread pool (typically min(32, os.cpu_count() + 4) threads), 20+ concurrent cleaning tasks saturate the pool, and subsequent requests queue behind them.
Why it happens: The crawler libraries (requests, requests_go) are synchronous. Wrapping them with to_thread is the correct pattern, but the thread pool limit is invisible until load hits it.
Prevention:
- Use
asyncio.Semaphoreto cap concurrency ofto_threadcalls within the cleaning scheduler — cap at(thread_pool_size // 3)to leave headroom for other async operations - Alternatively, migrate to
httpx.AsyncClientfor the embedded service context (not the standalone scripts, which can stay synchronous) - Set
UVICORN_WORKERSand thread pool size explicitly; document the relationship
Detection:
- Event loop lag exceeds 100ms during scheduled cleaning runs
asyncio.to_threadqueue depth grows without bound during batch operations
Phase: Phase addressing CompanyCleaner and CleaningService concurrency. Low urgency until company processing batch size increases.
Pitfall 6: ReplacingMergeTree Dedup Is Eventual, Not Immediate
What goes wrong: pending_company uses ENGINE = ReplacingMergeTree(version). The assumption is that inserting the same (source, company_id) twice will result in one row. This is wrong at query time. ClickHouse only deduplicates during background merges, which happen asynchronously. Between insertion and the next merge, both rows exist and will appear in SELECT. The CompanyCleaner.collect_pending_companies() query that reads pending_company may return duplicate company IDs, causing duplicate processing and potentially duplicate entries in MySQL.
Why it happens: ReplacingMergeTree is commonly misunderstood as providing immediate deduplication. It does not. The documentation is clear, but the confusion is pervasive.
Prevention:
- All queries against
pending_companymust useSELECT ... FROM pending_company FINALto force merge semantics at query time - Or: add
WHERE created_at = max(created_at) GROUP BY source, company_id— butFINALis simpler and correct - The insert logic should also guard against inserting the same
(source, company_id)in the same batch (pre-filter in application code)
Detection:
collect_pending_companies()returns the samecompany_idtwice in one batch- MySQL company table gets duplicate entries from two
collectruns in quick succession
Phase: Phase addressing company data pipeline. Must be fixed before increasing collect_pending_companies batch size.
Pitfall 7: The Three-Framework Problem Causes Refactoring Scope Creep
What goes wrong: The project currently has jobs_spider/, spiderJobs/, and app/services/crawler/ — three partially overlapping crawler implementations. During refactoring, there is a strong temptation to "also fix" the spiderJobs/ framework, or to make the new base class compatible with all three. This expands the scope from "refactor three platforms" to "unify three frameworks," which is a completely different project.
Why it happens: spiderJobs/ contains a cleaner architecture (core/, platforms/, runner/) that looks like the intended target. It is tempting to treat it as the destination rather than additional tech debt.
Prevention:
- Make a single explicit decision at the start:
spiderJobs/is deleted or kept. Do not leave it in a half-migrated state. - If deleted: do it in the first commit of the refactoring phase. An empty
spiderJobs/with aDEPRECATED.mdis fine. A half-migratedspiderJobs/that is 60% replaced is not. - The refactoring target is the new
_base.py,_boss_api.py,_boss_client.py,_boss_sign.pypattern already started inapp/services/crawler/— finish that pattern, don't restart withspiderJobs/
Detection:
- A PR diff that touches
spiderJobs/files during "platform rewrite" work is scope creep - Story points/time estimate doubles from what was planned
Phase: Decision must be made and documented in Phase 1 before any platform work begins.
Pitfall 8: Testing Crawlers by Hitting Real APIs
What goes wrong: Unit tests for crawler code make real HTTP requests to Boss直聘, 前程无忧, or 智联招聘. This causes: (a) tests fail in CI with no network access; (b) real request volume from CI triggers rate limiting or bans the test account's IP; (c) test results are non-deterministic (platform API changes break tests unexpectedly); (d) tests cannot run at all without valid tokens.
Why it happens: There are currently zero crawler tests. When tests are first written, the path of least resistance is to write integration tests against real endpoints. This "just works" locally but creates all of the above problems.
Prevention:
- All unit tests for crawler logic must use
unittest.mock.patchorpytest-mockto mock the HTTP layer - Test the layers independently: sign algorithm (pure function, no HTTP needed), response parser (provide fixture JSON, no HTTP needed), IP manager state machine (mock requests.Session)
- Maintain a
tests/fixtures/directory with saved real API response JSON (anonymized if needed) — test parsers against these fixtures, not live responses - A single integration test suite (marked
@pytest.mark.integration) can hit real APIs, but it should be opt-in and never run in CI
Detection:
- Test suite requires
API_BASE_URL,BOSS_TOKEN, or network access to pass - Tests fail with
ConnectionRefusedErrororTimeoutErrorin CI
Phase: Phase establishing the test infrastructure. Write mocked tests first — before implementing the refactored code — following TDD discipline.
Minor Pitfalls
Pitfall 9: Log File Creation in CWD Breaks When CWD Changes
What goes wrong: Both jobs_spider/boss/boos_api.py and jobs_spider/qcwy/qcwy.py contain os.makedirs("logs", exist_ok=True) and logger.add("logs/log_{time:YYYY-MM-DD}.log", ...). These create a logs/ directory relative to the current working directory when the script is launched. If ECS scripts are run from a different directory (e.g., /root instead of /app), logs appear in an unexpected location, making debugging harder.
Prevention: Use pathlib.Path(__file__).parent / "logs" for script-relative log paths.
Phase: Minor cleanup, address during standalone script refactoring.
Pitfall 10: Hardcoded Proxy URL Leaks Credentials
What goes wrong: jobs_spider/qcwy/qcwy.py line 31 contains PROXY_URL = "http://t13319619426654:ln8aj9nl@s432.kdltps.com:15818" — a real proxy credential hardcoded in source. This is committed to the repo.
Prevention: Remove immediately. Require PROXY_URL via environment variable with no default. Rotate the credential.
Phase: Pre-refactoring security fix. Do not include in any commit that goes to a shared remote.
Pitfall 11: ClickHouse Schema Cannot Be Evolved Without Manual DDL
What goes wrong: All six data tables use CREATE TABLE IF NOT EXISTS. Adding a new column to the DDL (e.g., adding province to boss_job) does nothing on a production database where the table already exists. The refactored crawler may produce records with a new field that the ClickHouse schema does not have, causing insert failures or silent field drops.
Prevention: For each schema change, write an explicit ALTER TABLE ... ADD COLUMN IF NOT EXISTS statement in clickhouse_init.py's upgrade path. Run this on startup using a version-tracking approach. Do not rely on CREATE TABLE IF NOT EXISTS for evolution.
Phase: Any phase that changes what fields the crawler stores. Medium urgency — only becomes critical when schema changes are needed.
Phase-Specific Warnings
| Phase Topic | Likely Pitfall | Mitigation |
|---|---|---|
| Establish shared crawler core | Pitfall 2 (copy-paste instead of share) | Decide the import mechanism first; prove it works with one sign file before touching anything else |
| Rewrite Boss直聘 client | Pitfall 1 (break running crawler) | Keep old boos_api.py running; use env flag to switch; validate side-by-side |
| Rewrite Boss直聘 client | Pitfall 3 (anti-bot state lost) | Test the new client as a standalone script with a real keyword before embedding in FastAPI |
| ClickHouse dedup refactoring | Pitfall 4 (full table scan) | Add days_back parameter as non-optional; add integration test with time bounds |
| Company pipeline optimization | Pitfall 6 (ReplacingMergeTree eventual) | Add FINAL to all pending_company SELECT queries before increasing batch size |
| Write tests | Pitfall 8 (real API in tests) | Mock at the HTTPClient.get/post level, not at the platform API level |
| Delete duplicate framework | Pitfall 7 (scope creep) | Delete spiderJobs/ explicitly in first PR; do not migrate it |
Sources
All findings are grounded in direct codebase inspection of this repository:
/Users/win/2025/AICoding/JobData/.planning/codebase/CONCERNS.md— security and tech debt audit/Users/win/2025/AICoding/JobData/app/services/ingest/dedup.py— dedup implementation/Users/win/2025/AICoding/JobData/app/services/ingest/service.py— ingest service/Users/win/2025/AICoding/JobData/app/core/clickhouse_init.py— schema DDL (MergeTree ORDER BY created_at, not dedup columns)/Users/win/2025/AICoding/JobData/app/services/crawler/_boss_sign.pyvs/Users/win/2025/AICoding/JobData/spiderJobs/platforms/boss/sign.py— confirmed verbatim duplication/Users/win/2025/AICoding/JobData/app/services/crawler/_base.py— existing base class/Users/win/2025/AICoding/JobData/app/services/crawler/_http_client.py— HTTP layer/Users/win/2025/AICoding/JobData/app/services/cleaning.py— async service wrapping sync crawlers/Users/win/2025/AICoding/JobData/jobs_spider/qcwy/qcwy.py— hardcoded proxy credential, sleep defaults differ from Boss/Users/win/2025/AICoding/JobData/jobs_spider/boss/boos_api.py— 2281-line god file, SmartIPManager/Users/win/2025/AICoding/JobData/.planning/PROJECT.md— refactoring constraints and goals