18 KiB
Technology Stack — 招聘数据爬虫重构
Project: JobData 爬虫交互层重构 Researched: 2026-03-21 Confidence: HIGH (existing code examined directly; external libraries verified against PyPI and official docs)
Summary Recommendation
Retain requests_go as the HTTP transport for crawlers (already working, actively maintained through Dec 2025, TLS fingerprint impersonation is critical for Boss). Adopt httpx.AsyncClient inside the FastAPI process only, where the event loop already runs. Standardize on curl_cffi as the long-term strategic replacement for requests_go if requests_go shows maintenance gaps — it is the better-documented, more widely adopted alternative with the same fingerprint capabilities. Use tenacity for retry logic, pytest + respx + pytest-anyio for testing, and a hybrid deduplication strategy (application-side pre-check with a 30-day recency window + ClickHouse MergeTree engine-level eventual dedup as backstop).
Recommended Stack
HTTP Client Layer
For External Crawler Scripts (jobs_spider/, spiderJobs/)
| Technology | Version | Purpose | Why |
|---|---|---|---|
requests_go |
1.0.9 (Dec 2025) | Primary HTTP client with Chrome TLS impersonation | Already in use in spiderJobs/core/http_client.py and app/services/crawler/_http_client.py. Latest release Dec 2025 confirms active maintenance. Provides TLS_CHROME_LATEST with random_ja3=True, which is the correct countermeasure for Chrome 110+ JA3 randomization. API is identical to requests, so the existing HTTPClient wrapper works without changes. |
curl_cffi |
0.7.x+ (Dec 2025) | Strategic future replacement | curl_cffi has broader community adoption, better documentation, and equivalent or superior fingerprint support (including HTTP/2 frame fingerprinting). If requests_go becomes unmaintained, migrate to curl_cffi. Both share a requests-compatible API. Do NOT switch mid-refactor; validate on a single platform first. |
Do NOT use plain requests for crawler HTTP — its default TLS stack (Python ssl module + OpenSSL) produces a fingerprint that platforms trivially detect. The existing requests imports in jobs_spider/boss/boos_api.py and jobs_spider/qcwy/qcwy.py are the root cause of the detection exposure in those legacy scripts.
Do NOT use aiohttp for crawlers — it offers no TLS fingerprint control and its API is more complex than requests/httpx for this use case.
For Backend Service Calls Inside FastAPI (app/services/crawler/)
| Technology | Version | Purpose | Why |
|---|---|---|---|
httpx |
0.28.1 (already in Pipfile) | Async HTTP from within FastAPI event loop | app/services/crawler/ runs inside the FastAPI async event loop. requests_go is synchronous; calling it from async code via asyncio.to_thread() is identified in CONCERNS.md as a thread-pool exhaustion risk under high concurrency. httpx.AsyncClient eliminates this. The existing HTTPClient wrapper interface (get, post returning tuple[int, Any]) can be preserved with an async version. |
The two contexts (external scripts vs. embedded in FastAPI) have different constraints and should use different transports. The shared BaseFetcher/BaseSearcher base classes can define the interface; each deployment context plugs in the appropriate client.
TLS Fingerprint Strategy
The HTTPClient class in spiderJobs/core/http_client.py (and its copy in app/services/crawler/_http_client.py) correctly:
- Sets
TLS_CHROME_LATESTas the TLS config - Enables
random_ja3=Trueto rotate JA3 hashes per Chrome 110+ behavior - Recreates sessions for tunnel proxies to force new TCP connections
Keep this pattern. Do not remove random JA3 — platforms have JA3 allowlists/denylist heuristics, and randomization within the Chrome profile mimics real Chrome behavior.
Retry Logic
| Technology | Version | Purpose | Why |
|---|---|---|---|
tenacity |
already in Pipfile (Spider-specific) | Retry with exponential backoff + jitter | Already listed as a spider dependency. Supports async/sync equally. Provides stop_after_attempt, wait_exponential, wait_random_exponential, retry_if_exception_type. Replace all bare except: pass retry patterns in jobs_spider/ with structured tenacity decorators. |
Tenacity configuration pattern for crawlers:
from tenacity import retry, stop_after_attempt, wait_random_exponential, retry_if_exception_type
import httpx
@retry(
stop=stop_after_attempt(3),
wait=wait_random_exponential(multiplier=1, min=10, max=30), # preserves 10s+ minimum
retry=retry_if_exception_type((httpx.TimeoutException, httpx.NetworkError)),
reraise=True,
)
def fetch_with_retry(...): ...
The min=10 maps to the existing SLEEP_MIN_SECONDS constraint. This replaces the current sleep_random_between() manual sleep pattern while preserving the anti-detection delay requirement.
Base Class Architecture
Recommendation: Keep the existing two-layer pattern from spiderJobs/ but make it the single source of truth.
The spiderJobs/ package has the best-designed structure:
core/base.py—ApiResult,BaseFetcher,BaseSearcher(platform-agnostic)core/http_client.py—HTTPClient(transport-agnostic)platforms/{boss,job51,zhilian}/{sign,client,api}.py— per-platform implementations
The app/services/crawler/ directory is a copy of this (confirmed by the "复制自" comments). The duplication is the problem, not the pattern.
Resolution: Make spiderJobs/ the authoritative shared package. Install it as an editable package (pip install -e ./spiderJobs) in both the app environment and the standalone spider environment. Remove the copies in app/services/crawler/. This gives one codebase, one set of tests, and eliminates the current copy-paste maintenance burden.
If editable install is not viable for the Docker deployment, use a uv workspace or pyproject.toml at the repo root referencing spiderJobs as a local path dependency.
# pyproject.toml (repo root, new file)
[tool.uv.workspace]
members = ["spiderJobs", "app"]
The base class pattern itself (abstract _build_params, concrete fetch/search) is correct and should be preserved as-is.
Anti-Detection Libraries and Approaches
| Concern | Approach | Library |
|---|---|---|
| TLS fingerprint | Chrome JA3 impersonation with randomization | requests_go (TLS_CHROME_LATEST, random_ja3=True) |
| HTTP/2 fingerprint | Handled by requests_go/curl_cffi |
Same library |
| IP blocking | Proxy pool rotation + tunnel proxy | Custom HTTPClient (already implemented) |
| Session fingerprint | Recreate session on IP switch | _new_session() pattern (already in HTTPClient) |
| Timing detection | Minimum 10s random delay between requests | tenacity wait_random_exponential(min=10, max=30) |
| Token expiry | Mark token invalid, fetch next from backend | Existing SmartIPManager pattern — keep |
| Anti-bot challenge (code=35/37) | Session rebuild + cookie refresh | Keep existing logic from BossZhipinAPI |
Do NOT add Playwright to the base crawler loop. Playwright adds 200-500ms startup overhead per page and is only justified for JavaScript-rendered anti-bot challenges. Reserve it for edge-case recovery only, not the main fetch path.
Data Deduplication at Insert Time
Strategy: Application-side pre-check with a 30-day recency window + engine-level backstop.
The current batch_dedup_filter in app/services/ingest/dedup.py does a SELECT key_col FROM table WHERE key_col IN (...) with no time bound. CONCERNS.md identifies this as a full-table scan on every ingest batch — a real performance problem as data grows.
Recommended approach:
# Add PREWHERE recency filter to all dedup queries
query = (
f"SELECT {key_col} FROM {table} "
f"PREWHERE created_at >= now() - INTERVAL 30 DAY " # <-- add this
f"WHERE {key_col} IN {{keys:Array(String)}}"
)
Use PREWHERE (not WHERE) — ClickHouse evaluates PREWHERE before reading full column data, making it significantly cheaper on MergeTree tables. A 30-day window matches the assumption in company_cleaner.py's days_back logic and covers realistic re-crawl cycles.
For ClickHouse table engine: The existing MergeTree tables for job data use a composite order key (job_id, created_at or similar). This is correct — MergeTree does not auto-deduplicate, but the engine-level ORDER BY ensures that a FINAL modifier or GROUP BY query can clean duplicates if needed. Do NOT switch to ReplacingMergeTree for job tables during this refactor — it adds background merge overhead and provides only eventual (not immediate) deduplication, while the application-side pre-check gives immediate guarantees at lower cost.
The existing pending_company table already uses ReplacingMergeTree(updated_at) with (source, company_id) dedup — this is correct and should stay.
Summary: No library change needed for deduplication. Fix the missing PREWHERE time-bound filter in dedup.py. Use PREWHERE over WHERE for all time-range filters in ClickHouse queries.
Testing Stack
| Technology | Version | Purpose | Why |
|---|---|---|---|
pytest |
latest stable (8.x) | Test runner | Standard; already used in tests/ (unittest style, but pytest-compatible) |
pytest-anyio |
0.0.0+ (or anyio[trio]) |
Async test support | Project uses anyio==4.8.0 already in Pipfile. pytest-anyio integrates directly with the anyio backend already present. Use @pytest.mark.anyio on async tests. |
respx |
0.21.x | Mock httpx requests |
Intercepts httpx.AsyncClient calls without hitting the network. Use in all tests that exercise app/services/crawler/ (the async versions). Pattern-matches on URL, method, headers. |
unittest.mock |
stdlib | Mock requests_go calls in external scripts |
For tests of jobs_spider/ and spiderJobs/ which use requests_go (synchronous), standard unittest.mock.patch on requests_go.Session.get/post is sufficient. No extra library needed. |
pytest-cov |
4.x | Coverage measurement | Enforce 80% coverage minimum on spiderJobs/core/, spiderJobs/platforms/*/sign.py, and all pure-function parsers. |
Testing patterns for crawlers:
-
Sign algorithm tests — Pure functions, no HTTP, no mocks needed.
BossSign.generate_traceid(),ZhilianSign.sign_headers(),_compute_checksum()are all deterministic enough to test with snapshot assertions. Test these first — they are the highest-value, lowest-cost tests. -
API parser tests — Feed recorded real response JSON fixtures into
BaseFetcher._parse()/ platform-specific_parse()overrides. VerifyApiResultfields. Usepytest.fixtureto load JSON fromtests/fixtures/. These tests do not need HTTP at all. -
HTTPClient / BaseFetcher integration tests — Use
unittest.mock.patchonrequests_go.Sessionto inject fixed(status_code, json)return values. Verify the client calls the right URL, sends correct headers, and handles errors. -
Ingest/dedup tests — The
batch_dedup_filterfunction takes anAsyncClient— userespxto mock ClickHouse query responses. Test: (a) single-column dedup filters correctly, (b) two-column dedup, (c) time-bound filter is present in generated query. -
Integration/smoke tests — One test per platform that replays a recorded full HTTP exchange (using
respxroute replay orresponseslibrary) and verifies the full pipeline fromBaseFetcher.fetch()throughbuild_insert_row(). Not against a real server.
Do NOT use pytest-asyncio alongside pytest-anyio — the two event loop managers conflict. The project already has anyio as a hard dependency (FastAPI uses it); pytest-anyio is the natural extension.
Existing test pattern: Current tests in tests/ use unittest.TestCase. These should be migrated to plain pytest style (no class inheritance) for simplicity, but this is not blocking.
Code Organization for Shared Core
Problem: Three parallel implementations of the same thing exist:
jobs_spider/boss/boos_api.py— 2281-line monolith (old)spiderJobs/platforms/boss/— split correctly into sign/client/api (new)app/services/crawler/— copy ofspiderJobs/with import paths changed
Recommended layout after refactor:
spiderJobs/ # Canonical shared package — install as editable dep
core/
base.py # ApiResult, BaseFetcher, BaseSearcher
http_client.py # HTTPClient (requests_go transport)
http_client_async.py # AsyncHTTPClient (httpx.AsyncClient transport) <-- NEW
platforms/
boss/
sign.py # BossSign (pure, no HTTP)
client.py # BossClient(HTTPClient)
client_async.py # BossAsyncClient(AsyncHTTPClient) <-- NEW for app embedding
api.py # Boss search/fetch using BossClient
job51/
sign.py
client.py
api.py
zhilian/
sign.py
client.py
api.py
runner/
loop.py # Main crawl loop (keyword fetch → crawl → push)
api_client.py # Push to backend API
jobs_spider/ # Entry-point scripts only, import from spiderJobs
boss/main.py
qcwy/main.py
zhilian/main.py
app/services/crawler/ # App-embedded adapters, import from spiderJobs
boss.py # BossService: thin wrapper using BossAsyncClient
qcwy.py
zhilian.py
The *_async.py variant of each client wraps httpx.AsyncClient with the same interface. Sign classes are transport-agnostic and shared.
Alternatives Considered
| Category | Recommended | Alternative | Why Not |
|---|---|---|---|
| TLS-aware HTTP client | requests_go (crawlers) |
curl_cffi |
Both equally capable. requests_go already integrated; switching mid-refactor adds risk. curl_cffi is the better long-term bet but not worth the disruption now. |
| Async HTTP (in-app) | httpx.AsyncClient |
aiohttp |
httpx already in Pipfile at 0.28.1. aiohttp API is more complex, no advantage here. |
| Async HTTP (in-app) | httpx.AsyncClient |
requests_go async |
requests_go's async mode is less documented; httpx is the FastAPI-idiomatic choice and has respx for testing. |
| Retry | tenacity |
Manual sleep loops | Tenacity already in Pipfile; structured retries with jitter are safer and testable. |
| Dedup strategy | App-side + PREWHERE time window | ReplacingMergeTree only | ReplacingMergeTree is eventual (not immediate); duplicates appear in queries until merge runs. App-side gives immediate guarantees at the cost of one extra SELECT per batch — acceptable. |
| Dedup strategy | App-side + PREWHERE time window | Full-table scan (current) | Current approach is a correctness/performance regression as table grows past millions of rows. |
| Test mocking | respx (httpx) + unittest.mock (requests_go) |
responses, pytest-mock |
respx is the purpose-built httpx mock; unittest.mock is stdlib. Minimum new dependencies. |
| Async test runner | pytest-anyio |
pytest-asyncio |
Project already depends on anyio 4.8.0. Mixing pytest-asyncio creates event loop conflicts. |
| Shared code structure | Editable install (pip install -e) |
File copies (current) | Current copy-paste approach (confirmed by "复制自" comments) guarantees drift. Editable install or uv workspace gives one truth. |
Versions and Installation
# No new major libraries needed for the refactor.
# Changes to existing dependencies:
# Add to Pipfile [packages]:
requests_go = "==1.0.9" # Pin to latest verified version
tenacity = "*" # Already in Pipfile as spider-specific
pytest = ">=8.0" # Dev dependency
pytest-anyio = "*" # Dev dependency
respx = ">=0.21" # Dev dependency
pytest-cov = ">=4.0" # Dev dependency
# Install spiderJobs as editable (development):
pip install -e ./spiderJobs
# Or with uv workspace (production-safe):
# Add pyproject.toml at repo root (see Code Organization section above)
Key Constraints Respected
- No new HTTP library for the main path —
requests_gostays for external scripts. Zero regression risk. - Async for in-app crawlers —
httpxis already in Pipfile. Theasyncio.to_thread()workaround in CONCERNS.md is eliminated. - 10-second minimum delay preserved —
tenacitywait_random_exponential(min=10)enforces this constraint. - Dedup at insert time — Application-side
batch_dedup_filterwith time-bounded PREWHERE. No schema changes. - Shared signing/parsing code — Single canonical
spiderJobs/package, installed into both environments. - ClickHouse table structure unchanged — No DDL changes required for any recommendation here.
Sources
requests_goPyPI: latest release 1.0.9 (December 2025) — pypi.orgcurl_cffiPyPI: latest release Dec 2025, Python 3.10+ required — pypi.org- TLS/JA3 fingerprinting landscape 2025: scrapfly.io, capsolver.com
httpxasync performance (7x over syncrequestsin concurrent scenarios): proxy-cheap.com, plainenglish.iorespxforhttpxmocking: rednafi.com, keploy.iopytest-anyiovspytest-asynciotradeoffs: FastAPI docs, medium.com- ClickHouse
ReplacingMergeTreevs application-side dedup: clickhouse.com, chistadata.com, glassflow.dev tenacityasync retry patterns: towardsai.net, leapcell.io- uv workspaces for Python monorepos: medium.com
Stack research: 2026-03-21