win f3005ef525 docs: add research findings (stack, features, architecture, pitfalls, summary)

2026-03-21 16:36:37 +08:00

19 KiB

Raw Permalink Blame History

Domain Pitfalls: 招聘数据爬虫重构

Domain: Chinese recruitment platform crawler refactoring (Boss直聘 / 前程无忧 / 智联招聘) Researched: 2026-03-21 Confidence: HIGH — all findings are grounded in direct codebase evidence from this repository

Critical Pitfalls

Mistakes that cause rewrites, data loss, or crawler shutdown.

Pitfall 1: Breaking the Running Crawler While Refactoring

What goes wrong: You refactor jobs_spider/boss/boos_api.py or app/services/crawler/ in-place. The new base class has a subtly different interface (parameter order, return type, exception handling). The old code path that was working for production data collection silently stops collecting, or starts pushing malformed data into ClickHouse.

Why it happens: This codebase already has three parallel spider implementations (jobs_spider/, spiderJobs/, app/services/crawler/) that evolved independently. The temptation during refactoring is to "just migrate" the live code directly. But the live scripts (jobs_spider/boss/boos_api.py) run as standalone ECS processes with a persistent loop — there is no deployment gate. Any change to the shared module takes effect on the next ECS restart or local re-run, with no rollback.

Consequences:

Zero data ingested for the duration of the broken crawler run (hours or days on ECS)
No test coverage currently means no automated detection of breakage
ClickHouse dedup logic will not surface "nothing came in" as an error

Prevention:

Always keep the old code path alive and parallel until the new one has ingested at least one full keyword sweep successfully
Use the feature-flag pattern: add a USE_NEW_CRAWLER=0/1 env variable; default to old code; validate new code side-by-side before switching default
Write a data-volume canary check: if a keyword cycle produces 0 records when it historically produces 50+, alert immediately rather than silently succeeding

Detection (warning signs):

push_job() returns success but ClickHouse row count for today is lower than yesterday's baseline
Crawler log shows "抓取完成" with 0 records for keywords that reliably returned results before
batch_dedup_filter ignored count is unexpectedly 100% (all records already existed, possibly because empty batches are silently accepted)

Phase: Address in the very first refactoring phase, before touching any shared path. Establish the parallel-run validation pattern before any code changes.

What goes wrong: The sign algorithm (_compute_checksum, _generate_uuid, BossSign) is already duplicated verbatim between spiderJobs/platforms/boss/sign.py and app/services/crawler/_boss_sign.py. The comment in _boss_sign.py says "复制自 spiderJobs/platforms/boss/sign.py". This copy-paste pattern will be repeated for the new unified base: if the shared core lives in app/, the jobs_spider/ standalone scripts cannot import it without adding the app to sys.path at runtime, which is fragile. The solution chosen in the current code — "copy the file with a comment" — is not sharing; it is duplication with extra steps.

Why it happens: Python's import system makes cross-directory sharing feel difficult when one consumer is an app package and the other is a standalone script. The easy path is copy-paste, which maintains independence but recreates the bug-propagation problem.

Consequences:

A future Boss anti-bot update requires changing the sign algorithm. The fix must be applied in 2+ places. One place is missed. One spider silently generates invalid traceid headers and gets 403s — but since there is no test, this is only discovered when data stops appearing.
Same problem will recur for HTTP client, IP manager, and data parser

Prevention:

Extract shared crawler core into a proper installable package (crawler_core/) or a well-defined subdirectory with a single source of truth
The jobs_spider/ scripts should sys.path.insert(0, repo_root) and import from the shared package — one authoritative source
Alternatively: keep app/services/crawler/ as the canonical location, and have jobs_spider/ scripts import directly from app.services.crawler.* after setting PYTHONPATH
Do NOT use the "复制自 X" comment pattern — it is a documentation of failure, not a solution

Detection:

Run diff spiderJobs/platforms/boss/sign.py app/services/crawler/_boss_sign.py — if they differ, a divergence has already occurred
Any comment containing "复制自" or "copy from" is a warning sign of duplicated code

Phase: Must be the foundational decision in Phase 1 before any platform-specific work. The sharing mechanism must be established first; all subsequent platform rewrites depend on it.

Pitfall 3: Anti-Bot State Is Instance-Local, Not Process-Global

What goes wrong: The refactored SmartIPManager, token cache, and session objects are created fresh per-request when embedded in the FastAPI async service (CleaningService, CompanyCleaner). Each call to CleaningService instantiates a new BossService, which creates a new BossSign and a new HTTPClient with a new session. The anti-bot state (IP failure counters, cooldown timers, session cookies) that Boss直聘 relies on to not trigger rate-limiting is discarded after each request.

Why it happens: The current CleaningService.__init__ creates self.boss_service = BossService() per instance. CleaningService itself is instantiated per API request. This works for the standalone script (one long-lived process, one SmartIPManager) but the FastAPI service context has a completely different lifecycle.

Consequences:

The standalone scripts maintain session state correctly (SmartIPManager accumulates failure counts, applies cooldowns)
The embedded service makes "cold" requests every time, resetting the TLS fingerprint session and appearing as new clients — which is exactly what anti-bot systems profile
The __zp_sseed__ / __zp_sname__ / __zp_sts__ cookie state managed in boos_api.py (anti-bot response at code=37) exists only in the single-process script

Prevention:

The shared core (IP manager, session, token cache) must be process-level singletons when used inside FastAPI, not per-request instances
Use FastAPI's lifespan to initialize a single shared crawler context; inject it via dependency
For the standalone scripts, the existing per-process instance model is correct and should not change

Detection:

Boss returns code=35 (IP banned) or code=37 (anti-bot challenge) on the first request from the FastAPI service context
High 403/429 rate from embedded cleaning service, while standalone scripts on the same IP work fine

Phase: Phase addressing the FastAPI embedded crawler service. Do not assume the standalone script's session management model transfers to async service context.

Pitfall 4: ClickHouse Dedup Scans the Full Table

What goes wrong: batch_dedup_filter in app/services/ingest/dedup.py issues SELECT job_id FROM job_data.boss_job WHERE job_id IN (...) with no time-bound filter. For boss_job after months of operation, this scans every row. The MergeTree ORDER BY created_at means job_id is not in the primary key — the scan will read the entire column. At scale (millions of rows), this dedup check becomes the bottleneck and eventually causes timeouts, causing the ingest service to fallback to check_duplicate=False (silently inserting duplicates).

Why it happens: The current schema uses ORDER BY created_at for all tables (confirmed in clickhouse_init.py). Dedup columns (job_id, number, etc.) are secondary columns with no index. This was fine when tables were small.

Consequences:

Ingest latency spikes as tables grow, eventually timing out
Silent duplicate insertion when dedup times out and the batch is inserted anyway
The CONCERNS.md already flags this — but the fix (adding PREWHERE created_at > now() - INTERVAL N DAY) cannot be done as a view change; it requires changing the dedup query logic

Prevention:

Add a configurable days_back parameter (defaulting to 30) to all dedup queries: WHERE job_id IN (...) AND created_at >= now() - INTERVAL 30 DAY
This matches the existing company_cleaner.py days_back=90 pattern already in the codebase
During refactoring, add this as a mandatory parameter to batch_dedup_filter — not an optional afterthought
Consider adding a secondary index (INDEX idx_job_id job_id TYPE bloom_filter GRANULARITY 4) in the DDL for tables where dedup is critical

Detection:

ClickHouse query log shows dedup_filter queries scanning >1M rows with no created_at filter
Ingest endpoint latency increases week-over-week as table grows
duplicate count in ingest result drops to 0 even though the same crawler ran yesterday (dedup failing silently)

Phase: Phase 1 (dedup infrastructure). This is a data correctness issue, not a performance optimization. Fix before the refactored crawlers start pushing higher volumes.

Moderate Pitfalls

Pitfall 5: Sync `requests` Blocking the Async Event Loop

What goes wrong: The refactored crawler services use requests (sync) wrapped in asyncio.to_thread(). Under concurrent cleaning requests or batch company processing, the thread pool fills up. The CleaningService spawns 50 company cleanings in one scheduler tick — each calls to_thread(boss_service.get_company_detail_by_id). With uvicorn's default thread pool (typically min(32, os.cpu_count() + 4) threads), 20+ concurrent cleaning tasks saturate the pool, and subsequent requests queue behind them.

Why it happens: The crawler libraries (requests, requests_go) are synchronous. Wrapping them with to_thread is the correct pattern, but the thread pool limit is invisible until load hits it.

Prevention:

Use asyncio.Semaphore to cap concurrency of to_thread calls within the cleaning scheduler — cap at (thread_pool_size // 3) to leave headroom for other async operations
Alternatively, migrate to httpx.AsyncClient for the embedded service context (not the standalone scripts, which can stay synchronous)
Set UVICORN_WORKERS and thread pool size explicitly; document the relationship

Detection:

Event loop lag exceeds 100ms during scheduled cleaning runs
asyncio.to_thread queue depth grows without bound during batch operations

Phase: Phase addressing CompanyCleaner and CleaningService concurrency. Low urgency until company processing batch size increases.

Pitfall 6: ReplacingMergeTree Dedup Is Eventual, Not Immediate

What goes wrong: pending_company uses ENGINE = ReplacingMergeTree(version). The assumption is that inserting the same (source, company_id) twice will result in one row. This is wrong at query time. ClickHouse only deduplicates during background merges, which happen asynchronously. Between insertion and the next merge, both rows exist and will appear in SELECT. The CompanyCleaner.collect_pending_companies() query that reads pending_company may return duplicate company IDs, causing duplicate processing and potentially duplicate entries in MySQL.

Why it happens: ReplacingMergeTree is commonly misunderstood as providing immediate deduplication. It does not. The documentation is clear, but the confusion is pervasive.

Prevention:

All queries against pending_company must use SELECT ... FROM pending_company FINAL to force merge semantics at query time
Or: add WHERE created_at = max(created_at) GROUP BY source, company_id — but FINAL is simpler and correct
The insert logic should also guard against inserting the same (source, company_id) in the same batch (pre-filter in application code)

Detection:

collect_pending_companies() returns the same company_id twice in one batch
MySQL company table gets duplicate entries from two collect runs in quick succession

Phase: Phase addressing company data pipeline. Must be fixed before increasing collect_pending_companies batch size.

Pitfall 7: The Three-Framework Problem Causes Refactoring Scope Creep

What goes wrong: The project currently has jobs_spider/, spiderJobs/, and app/services/crawler/ — three partially overlapping crawler implementations. During refactoring, there is a strong temptation to "also fix" the spiderJobs/ framework, or to make the new base class compatible with all three. This expands the scope from "refactor three platforms" to "unify three frameworks," which is a completely different project.

Why it happens: spiderJobs/ contains a cleaner architecture (core/, platforms/, runner/) that looks like the intended target. It is tempting to treat it as the destination rather than additional tech debt.

Prevention:

Make a single explicit decision at the start: spiderJobs/ is deleted or kept. Do not leave it in a half-migrated state.
If deleted: do it in the first commit of the refactoring phase. An empty spiderJobs/ with a DEPRECATED.md is fine. A half-migrated spiderJobs/ that is 60% replaced is not.
The refactoring target is the new _base.py, _boss_api.py, _boss_client.py, _boss_sign.py pattern already started in app/services/crawler/ — finish that pattern, don't restart with spiderJobs/

Detection:

A PR diff that touches spiderJobs/ files during "platform rewrite" work is scope creep
Story points/time estimate doubles from what was planned

Phase: Decision must be made and documented in Phase 1 before any platform work begins.

Pitfall 8: Testing Crawlers by Hitting Real APIs

What goes wrong: Unit tests for crawler code make real HTTP requests to Boss直聘, 前程无忧, or 智联招聘. This causes: (a) tests fail in CI with no network access; (b) real request volume from CI triggers rate limiting or bans the test account's IP; (c) test results are non-deterministic (platform API changes break tests unexpectedly); (d) tests cannot run at all without valid tokens.

Why it happens: There are currently zero crawler tests. When tests are first written, the path of least resistance is to write integration tests against real endpoints. This "just works" locally but creates all of the above problems.

Prevention:

All unit tests for crawler logic must use unittest.mock.patch or pytest-mock to mock the HTTP layer
Test the layers independently: sign algorithm (pure function, no HTTP needed), response parser (provide fixture JSON, no HTTP needed), IP manager state machine (mock requests.Session)
Maintain a tests/fixtures/ directory with saved real API response JSON (anonymized if needed) — test parsers against these fixtures, not live responses
A single integration test suite (marked @pytest.mark.integration) can hit real APIs, but it should be opt-in and never run in CI

Detection:

Test suite requires API_BASE_URL, BOSS_TOKEN, or network access to pass
Tests fail with ConnectionRefusedError or TimeoutError in CI

Phase: Phase establishing the test infrastructure. Write mocked tests first — before implementing the refactored code — following TDD discipline.

Minor Pitfalls

Pitfall 9: Log File Creation in CWD Breaks When CWD Changes

What goes wrong: Both jobs_spider/boss/boos_api.py and jobs_spider/qcwy/qcwy.py contain os.makedirs("logs", exist_ok=True) and logger.add("logs/log_{time:YYYY-MM-DD}.log", ...). These create a logs/ directory relative to the current working directory when the script is launched. If ECS scripts are run from a different directory (e.g., /root instead of /app), logs appear in an unexpected location, making debugging harder.

Prevention: Use pathlib.Path(__file__).parent / "logs" for script-relative log paths.

Phase: Minor cleanup, address during standalone script refactoring.

Pitfall 10: Hardcoded Proxy URL Leaks Credentials

What goes wrong: jobs_spider/qcwy/qcwy.py line 31 contains PROXY_URL = "http://t13319619426654:ln8aj9nl@s432.kdltps.com:15818" — a real proxy credential hardcoded in source. This is committed to the repo.

Prevention: Remove immediately. Require PROXY_URL via environment variable with no default. Rotate the credential.

Phase: Pre-refactoring security fix. Do not include in any commit that goes to a shared remote.

Pitfall 11: ClickHouse Schema Cannot Be Evolved Without Manual DDL

What goes wrong: All six data tables use CREATE TABLE IF NOT EXISTS. Adding a new column to the DDL (e.g., adding province to boss_job) does nothing on a production database where the table already exists. The refactored crawler may produce records with a new field that the ClickHouse schema does not have, causing insert failures or silent field drops.

Prevention: For each schema change, write an explicit ALTER TABLE ... ADD COLUMN IF NOT EXISTS statement in clickhouse_init.py's upgrade path. Run this on startup using a version-tracking approach. Do not rely on CREATE TABLE IF NOT EXISTS for evolution.

Phase: Any phase that changes what fields the crawler stores. Medium urgency — only becomes critical when schema changes are needed.

Phase-Specific Warnings

Phase Topic	Likely Pitfall	Mitigation
Establish shared crawler core	Pitfall 2 (copy-paste instead of share)	Decide the import mechanism first; prove it works with one sign file before touching anything else
Rewrite Boss直聘 client	Pitfall 1 (break running crawler)	Keep old `boos_api.py` running; use env flag to switch; validate side-by-side
Rewrite Boss直聘 client	Pitfall 3 (anti-bot state lost)	Test the new client as a standalone script with a real keyword before embedding in FastAPI
ClickHouse dedup refactoring	Pitfall 4 (full table scan)	Add `days_back` parameter as non-optional; add integration test with time bounds
Company pipeline optimization	Pitfall 6 (ReplacingMergeTree eventual)	Add `FINAL` to all `pending_company` SELECT queries before increasing batch size
Write tests	Pitfall 8 (real API in tests)	Mock at the `HTTPClient.get/post` level, not at the platform API level
Delete duplicate framework	Pitfall 7 (scope creep)	Delete `spiderJobs/` explicitly in first PR; do not migrate it

Sources

All findings are grounded in direct codebase inspection of this repository:

/Users/win/2025/AICoding/JobData/.planning/codebase/CONCERNS.md — security and tech debt audit
/Users/win/2025/AICoding/JobData/app/services/ingest/dedup.py — dedup implementation
/Users/win/2025/AICoding/JobData/app/services/ingest/service.py — ingest service
/Users/win/2025/AICoding/JobData/app/core/clickhouse_init.py — schema DDL (MergeTree ORDER BY created_at, not dedup columns)
/Users/win/2025/AICoding/JobData/app/services/crawler/_boss_sign.py vs /Users/win/2025/AICoding/JobData/spiderJobs/platforms/boss/sign.py — confirmed verbatim duplication
/Users/win/2025/AICoding/JobData/app/services/crawler/_base.py — existing base class
/Users/win/2025/AICoding/JobData/app/services/crawler/_http_client.py — HTTP layer
/Users/win/2025/AICoding/JobData/app/services/cleaning.py — async service wrapping sync crawlers
/Users/win/2025/AICoding/JobData/jobs_spider/qcwy/qcwy.py — hardcoded proxy credential, sleep defaults differ from Boss
/Users/win/2025/AICoding/JobData/jobs_spider/boss/boos_api.py — 2281-line god file, SmartIPManager
/Users/win/2025/AICoding/JobData/.planning/PROJECT.md — refactoring constraints and goals

19 KiB Raw Permalink Blame History

Domain Pitfalls: 招聘数据爬虫重构

Critical Pitfalls

Pitfall 1: Breaking the Running Crawler While Refactoring

Pitfall 2: Copying Sign/Algorithm Code Instead of Sharing It

Pitfall 3: Anti-Bot State Is Instance-Local, Not Process-Global

Pitfall 4: ClickHouse Dedup Scans the Full Table

Moderate Pitfalls

Pitfall 5: Sync requests Blocking the Async Event Loop

Pitfall 6: ReplacingMergeTree Dedup Is Eventual, Not Immediate

Pitfall 7: The Three-Framework Problem Causes Refactoring Scope Creep

Pitfall 8: Testing Crawlers by Hitting Real APIs

Minor Pitfalls

Pitfall 9: Log File Creation in CWD Breaks When CWD Changes

Pitfall 10: Hardcoded Proxy URL Leaks Credentials

Pitfall 11: ClickHouse Schema Cannot Be Evolved Without Manual DDL

Phase-Specific Warnings

Sources

19 KiB

Raw Permalink Blame History

Pitfall 5: Sync `requests` Blocking the Async Event Loop