win f3005ef525 docs: add research findings (stack, features, architecture, pitfalls, summary)

2026-03-21 16:36:37 +08:00

18 KiB

Raw Blame History

Technology Stack — 招聘数据爬虫重构

Project: JobData 爬虫交互层重构 Researched: 2026-03-21 Confidence: HIGH (existing code examined directly; external libraries verified against PyPI and official docs)

Summary Recommendation

Retain requests_go as the HTTP transport for crawlers (already working, actively maintained through Dec 2025, TLS fingerprint impersonation is critical for Boss). Adopt httpx.AsyncClient inside the FastAPI process only, where the event loop already runs. Standardize on curl_cffi as the long-term strategic replacement for requests_go if requests_go shows maintenance gaps — it is the better-documented, more widely adopted alternative with the same fingerprint capabilities. Use tenacity for retry logic, pytest + respx + pytest-anyio for testing, and a hybrid deduplication strategy (application-side pre-check with a 30-day recency window + ClickHouse MergeTree engine-level eventual dedup as backstop).

Recommended Stack

HTTP Client Layer

For External Crawler Scripts (`jobs_spider/`, `spiderJobs/`)

Technology	Version	Purpose	Why
`requests_go`	1.0.9 (Dec 2025)	Primary HTTP client with Chrome TLS impersonation	Already in use in `spiderJobs/core/http_client.py` and `app/services/crawler/_http_client.py`. Latest release Dec 2025 confirms active maintenance. Provides `TLS_CHROME_LATEST` with `random_ja3=True`, which is the correct countermeasure for Chrome 110+ JA3 randomization. API is identical to `requests`, so the existing `HTTPClient` wrapper works without changes.
`curl_cffi`	0.7.x+ (Dec 2025)	Strategic future replacement	`curl_cffi` has broader community adoption, better documentation, and equivalent or superior fingerprint support (including HTTP/2 frame fingerprinting). If `requests_go` becomes unmaintained, migrate to `curl_cffi`. Both share a `requests`-compatible API. Do NOT switch mid-refactor; validate on a single platform first.

Do NOT use plain requests for crawler HTTP — its default TLS stack (Python ssl module + OpenSSL) produces a fingerprint that platforms trivially detect. The existing requests imports in jobs_spider/boss/boos_api.py and jobs_spider/qcwy/qcwy.py are the root cause of the detection exposure in those legacy scripts.

Do NOT use aiohttp for crawlers — it offers no TLS fingerprint control and its API is more complex than requests/httpx for this use case.

For Backend Service Calls Inside FastAPI (`app/services/crawler/`)

Technology	Version	Purpose	Why
`httpx`	0.28.1 (already in Pipfile)	Async HTTP from within FastAPI event loop	`app/services/crawler/` runs inside the FastAPI async event loop. `requests_go` is synchronous; calling it from async code via `asyncio.to_thread()` is identified in CONCERNS.md as a thread-pool exhaustion risk under high concurrency. `httpx.AsyncClient` eliminates this. The existing `HTTPClient` wrapper interface (`get`, `post` returning `tuple[int, Any]`) can be preserved with an async version.

The two contexts (external scripts vs. embedded in FastAPI) have different constraints and should use different transports. The shared BaseFetcher/BaseSearcher base classes can define the interface; each deployment context plugs in the appropriate client.

TLS Fingerprint Strategy

The HTTPClient class in spiderJobs/core/http_client.py (and its copy in app/services/crawler/_http_client.py) correctly:

Sets TLS_CHROME_LATEST as the TLS config
Enables random_ja3=True to rotate JA3 hashes per Chrome 110+ behavior
Recreates sessions for tunnel proxies to force new TCP connections

Keep this pattern. Do not remove random JA3 — platforms have JA3 allowlists/denylist heuristics, and randomization within the Chrome profile mimics real Chrome behavior.

Retry Logic

Technology	Version	Purpose	Why
`tenacity`	already in Pipfile (Spider-specific)	Retry with exponential backoff + jitter	Already listed as a spider dependency. Supports async/sync equally. Provides `stop_after_attempt`, `wait_exponential`, `wait_random_exponential`, `retry_if_exception_type`. Replace all bare `except: pass` retry patterns in `jobs_spider/` with structured tenacity decorators.

Tenacity configuration pattern for crawlers:

from tenacity import retry, stop_after_attempt, wait_random_exponential, retry_if_exception_type
import httpx

@retry(
    stop=stop_after_attempt(3),
    wait=wait_random_exponential(multiplier=1, min=10, max=30),  # preserves 10s+ minimum
    retry=retry_if_exception_type((httpx.TimeoutException, httpx.NetworkError)),
    reraise=True,
)
def fetch_with_retry(...): ...

The min=10 maps to the existing SLEEP_MIN_SECONDS constraint. This replaces the current sleep_random_between() manual sleep pattern while preserving the anti-detection delay requirement.

Base Class Architecture

Recommendation: Keep the existing two-layer pattern from spiderJobs/ but make it the single source of truth.

The spiderJobs/ package has the best-designed structure:

core/base.py — ApiResult, BaseFetcher, BaseSearcher (platform-agnostic)
core/http_client.py — HTTPClient (transport-agnostic)
platforms/{boss,job51,zhilian}/{sign,client,api}.py — per-platform implementations

The app/services/crawler/ directory is a copy of this (confirmed by the "复制自" comments). The duplication is the problem, not the pattern.

Resolution: Make spiderJobs/ the authoritative shared package. Install it as an editable package (pip install -e ./spiderJobs) in both the app environment and the standalone spider environment. Remove the copies in app/services/crawler/. This gives one codebase, one set of tests, and eliminates the current copy-paste maintenance burden.

If editable install is not viable for the Docker deployment, use a uv workspace or pyproject.toml at the repo root referencing spiderJobs as a local path dependency.

# pyproject.toml (repo root, new file)
[tool.uv.workspace]
members = ["spiderJobs", "app"]

The base class pattern itself (abstract _build_params, concrete fetch/search) is correct and should be preserved as-is.

Anti-Detection Libraries and Approaches

Concern	Approach	Library
TLS fingerprint	Chrome JA3 impersonation with randomization	`requests_go` (`TLS_CHROME_LATEST`, `random_ja3=True`)
HTTP/2 fingerprint	Handled by `requests_go`/`curl_cffi`	Same library
IP blocking	Proxy pool rotation + tunnel proxy	Custom `HTTPClient` (already implemented)
Session fingerprint	Recreate session on IP switch	`_new_session()` pattern (already in `HTTPClient`)
Timing detection	Minimum 10s random delay between requests	`tenacity` `wait_random_exponential(min=10, max=30)`
Token expiry	Mark token invalid, fetch next from backend	Existing `SmartIPManager` pattern — keep
Anti-bot challenge (code=35/37)	Session rebuild + cookie refresh	Keep existing logic from `BossZhipinAPI`

Do NOT add Playwright to the base crawler loop. Playwright adds 200-500ms startup overhead per page and is only justified for JavaScript-rendered anti-bot challenges. Reserve it for edge-case recovery only, not the main fetch path.

Data Deduplication at Insert Time

Strategy: Application-side pre-check with a 30-day recency window + engine-level backstop.

The current batch_dedup_filter in app/services/ingest/dedup.py does a SELECT key_col FROM table WHERE key_col IN (...) with no time bound. CONCERNS.md identifies this as a full-table scan on every ingest batch — a real performance problem as data grows.

Recommended approach:

# Add PREWHERE recency filter to all dedup queries
query = (
    f"SELECT {key_col} FROM {table} "
    f"PREWHERE created_at >= now() - INTERVAL 30 DAY "   # <-- add this
    f"WHERE {key_col} IN {{keys:Array(String)}}"
)

Use PREWHERE (not WHERE) — ClickHouse evaluates PREWHERE before reading full column data, making it significantly cheaper on MergeTree tables. A 30-day window matches the assumption in company_cleaner.py's days_back logic and covers realistic re-crawl cycles.

For ClickHouse table engine: The existing MergeTree tables for job data use a composite order key (job_id, created_at or similar). This is correct — MergeTree does not auto-deduplicate, but the engine-level ORDER BY ensures that a FINAL modifier or GROUP BY query can clean duplicates if needed. Do NOT switch to ReplacingMergeTree for job tables during this refactor — it adds background merge overhead and provides only eventual (not immediate) deduplication, while the application-side pre-check gives immediate guarantees at lower cost.

The existing pending_company table already uses ReplacingMergeTree(updated_at) with (source, company_id) dedup — this is correct and should stay.

Summary: No library change needed for deduplication. Fix the missing PREWHERE time-bound filter in dedup.py. Use PREWHERE over WHERE for all time-range filters in ClickHouse queries.

Testing Stack

Technology	Version	Purpose	Why
`pytest`	latest stable (8.x)	Test runner	Standard; already used in `tests/` (unittest style, but pytest-compatible)
`pytest-anyio`	0.0.0+ (or `anyio[trio]`)	Async test support	Project uses `anyio==4.8.0` already in Pipfile. `pytest-anyio` integrates directly with the `anyio` backend already present. Use `@pytest.mark.anyio` on async tests.
`respx`	0.21.x	Mock `httpx` requests	Intercepts `httpx.AsyncClient` calls without hitting the network. Use in all tests that exercise `app/services/crawler/` (the async versions). Pattern-matches on URL, method, headers.
`unittest.mock`	stdlib	Mock `requests_go` calls in external scripts	For tests of `jobs_spider/` and `spiderJobs/` which use `requests_go` (synchronous), standard `unittest.mock.patch` on `requests_go.Session.get/post` is sufficient. No extra library needed.
`pytest-cov`	4.x	Coverage measurement	Enforce 80% coverage minimum on `spiderJobs/core/`, `spiderJobs/platforms/*/sign.py`, and all pure-function parsers.

Testing patterns for crawlers:

Sign algorithm tests — Pure functions, no HTTP, no mocks needed. BossSign.generate_traceid(), ZhilianSign.sign_headers(), _compute_checksum() are all deterministic enough to test with snapshot assertions. Test these first — they are the highest-value, lowest-cost tests.
API parser tests — Feed recorded real response JSON fixtures into BaseFetcher._parse() / platform-specific _parse() overrides. Verify ApiResult fields. Use pytest.fixture to load JSON from tests/fixtures/. These tests do not need HTTP at all.
HTTPClient / BaseFetcher integration tests — Use unittest.mock.patch on requests_go.Session to inject fixed (status_code, json) return values. Verify the client calls the right URL, sends correct headers, and handles errors.
Ingest/dedup tests — The batch_dedup_filter function takes an AsyncClient — use respx to mock ClickHouse query responses. Test: (a) single-column dedup filters correctly, (b) two-column dedup, (c) time-bound filter is present in generated query.
Integration/smoke tests — One test per platform that replays a recorded full HTTP exchange (using respx route replay or responses library) and verifies the full pipeline from BaseFetcher.fetch() through build_insert_row(). Not against a real server.

Do NOT use pytest-asyncio alongside pytest-anyio — the two event loop managers conflict. The project already has anyio as a hard dependency (FastAPI uses it); pytest-anyio is the natural extension.

Existing test pattern: Current tests in tests/ use unittest.TestCase. These should be migrated to plain pytest style (no class inheritance) for simplicity, but this is not blocking.

Code Organization for Shared Core

Problem: Three parallel implementations of the same thing exist:

jobs_spider/boss/boos_api.py — 2281-line monolith (old)
spiderJobs/platforms/boss/ — split correctly into sign/client/api (new)
app/services/crawler/ — copy of spiderJobs/ with import paths changed

Recommended layout after refactor:

spiderJobs/              # Canonical shared package — install as editable dep
  core/
    base.py              # ApiResult, BaseFetcher, BaseSearcher
    http_client.py       # HTTPClient (requests_go transport)
    http_client_async.py # AsyncHTTPClient (httpx.AsyncClient transport)  <-- NEW
  platforms/
    boss/
      sign.py            # BossSign (pure, no HTTP)
      client.py          # BossClient(HTTPClient)
      client_async.py    # BossAsyncClient(AsyncHTTPClient)  <-- NEW for app embedding
      api.py             # Boss search/fetch using BossClient
    job51/
      sign.py
      client.py
      api.py
    zhilian/
      sign.py
      client.py
      api.py
  runner/
    loop.py              # Main crawl loop (keyword fetch → crawl → push)
    api_client.py        # Push to backend API

jobs_spider/             # Entry-point scripts only, import from spiderJobs
  boss/main.py
  qcwy/main.py
  zhilian/main.py

app/services/crawler/    # App-embedded adapters, import from spiderJobs
  boss.py                # BossService: thin wrapper using BossAsyncClient
  qcwy.py
  zhilian.py

The *_async.py variant of each client wraps httpx.AsyncClient with the same interface. Sign classes are transport-agnostic and shared.

Alternatives Considered

Category	Recommended	Alternative	Why Not
TLS-aware HTTP client	`requests_go` (crawlers)	`curl_cffi`	Both equally capable. `requests_go` already integrated; switching mid-refactor adds risk. `curl_cffi` is the better long-term bet but not worth the disruption now.
Async HTTP (in-app)	`httpx.AsyncClient`	`aiohttp`	`httpx` already in Pipfile at 0.28.1. `aiohttp` API is more complex, no advantage here.
Async HTTP (in-app)	`httpx.AsyncClient`	`requests_go` async	`requests_go`'s async mode is less documented; `httpx` is the FastAPI-idiomatic choice and has `respx` for testing.
Retry	`tenacity`	Manual sleep loops	Tenacity already in Pipfile; structured retries with jitter are safer and testable.
Dedup strategy	App-side + PREWHERE time window	ReplacingMergeTree only	ReplacingMergeTree is eventual (not immediate); duplicates appear in queries until merge runs. App-side gives immediate guarantees at the cost of one extra SELECT per batch — acceptable.
Dedup strategy	App-side + PREWHERE time window	Full-table scan (current)	Current approach is a correctness/performance regression as table grows past millions of rows.
Test mocking	`respx` (httpx) + `unittest.mock` (requests_go)	`responses`, `pytest-mock`	`respx` is the purpose-built `httpx` mock; `unittest.mock` is stdlib. Minimum new dependencies.
Async test runner	`pytest-anyio`	`pytest-asyncio`	Project already depends on `anyio` 4.8.0. Mixing `pytest-asyncio` creates event loop conflicts.
Shared code structure	Editable install (`pip install -e`)	File copies (current)	Current copy-paste approach (confirmed by "复制自" comments) guarantees drift. Editable install or uv workspace gives one truth.

Versions and Installation

# No new major libraries needed for the refactor.
# Changes to existing dependencies:

# Add to Pipfile [packages]:
requests_go = "==1.0.9"     # Pin to latest verified version
tenacity = "*"               # Already in Pipfile as spider-specific
pytest = ">=8.0"             # Dev dependency
pytest-anyio = "*"           # Dev dependency
respx = ">=0.21"             # Dev dependency
pytest-cov = ">=4.0"         # Dev dependency

# Install spiderJobs as editable (development):
pip install -e ./spiderJobs

# Or with uv workspace (production-safe):
# Add pyproject.toml at repo root (see Code Organization section above)

Key Constraints Respected

No new HTTP library for the main path — requests_go stays for external scripts. Zero regression risk.
Async for in-app crawlers — httpx is already in Pipfile. The asyncio.to_thread() workaround in CONCERNS.md is eliminated.
10-second minimum delay preserved — tenacity wait_random_exponential(min=10) enforces this constraint.
Dedup at insert time — Application-side batch_dedup_filter with time-bounded PREWHERE. No schema changes.
Shared signing/parsing code — Single canonical spiderJobs/ package, installed into both environments.
ClickHouse table structure unchanged — No DDL changes required for any recommendation here.

Sources

requests_go PyPI: latest release 1.0.9 (December 2025) — pypi.org
curl_cffi PyPI: latest release Dec 2025, Python 3.10+ required — pypi.org
TLS/JA3 fingerprinting landscape 2025: scrapfly.io, capsolver.com
httpx async performance (7x over sync requests in concurrent scenarios): proxy-cheap.com, plainenglish.io
respx for httpx mocking: rednafi.com, keploy.io
pytest-anyio vs pytest-asyncio tradeoffs: FastAPI docs, medium.com
ClickHouse ReplacingMergeTree vs application-side dedup: clickhouse.com, chistadata.com, glassflow.dev
tenacity async retry patterns: towardsai.net, leapcell.io
uv workspaces for Python monorepos: medium.com

Stack research: 2026-03-21

18 KiB Raw Blame History