18 KiB

Technology Stack — 招聘数据爬虫重构

Project: JobData 爬虫交互层重构 Researched: 2026-03-21 Confidence: HIGH (existing code examined directly; external libraries verified against PyPI and official docs)


Summary Recommendation

Retain requests_go as the HTTP transport for crawlers (already working, actively maintained through Dec 2025, TLS fingerprint impersonation is critical for Boss). Adopt httpx.AsyncClient inside the FastAPI process only, where the event loop already runs. Standardize on curl_cffi as the long-term strategic replacement for requests_go if requests_go shows maintenance gaps — it is the better-documented, more widely adopted alternative with the same fingerprint capabilities. Use tenacity for retry logic, pytest + respx + pytest-anyio for testing, and a hybrid deduplication strategy (application-side pre-check with a 30-day recency window + ClickHouse MergeTree engine-level eventual dedup as backstop).


HTTP Client Layer

For External Crawler Scripts (jobs_spider/, spiderJobs/)

Technology Version Purpose Why
requests_go 1.0.9 (Dec 2025) Primary HTTP client with Chrome TLS impersonation Already in use in spiderJobs/core/http_client.py and app/services/crawler/_http_client.py. Latest release Dec 2025 confirms active maintenance. Provides TLS_CHROME_LATEST with random_ja3=True, which is the correct countermeasure for Chrome 110+ JA3 randomization. API is identical to requests, so the existing HTTPClient wrapper works without changes.
curl_cffi 0.7.x+ (Dec 2025) Strategic future replacement curl_cffi has broader community adoption, better documentation, and equivalent or superior fingerprint support (including HTTP/2 frame fingerprinting). If requests_go becomes unmaintained, migrate to curl_cffi. Both share a requests-compatible API. Do NOT switch mid-refactor; validate on a single platform first.

Do NOT use plain requests for crawler HTTP — its default TLS stack (Python ssl module + OpenSSL) produces a fingerprint that platforms trivially detect. The existing requests imports in jobs_spider/boss/boos_api.py and jobs_spider/qcwy/qcwy.py are the root cause of the detection exposure in those legacy scripts.

Do NOT use aiohttp for crawlers — it offers no TLS fingerprint control and its API is more complex than requests/httpx for this use case.

For Backend Service Calls Inside FastAPI (app/services/crawler/)

Technology Version Purpose Why
httpx 0.28.1 (already in Pipfile) Async HTTP from within FastAPI event loop app/services/crawler/ runs inside the FastAPI async event loop. requests_go is synchronous; calling it from async code via asyncio.to_thread() is identified in CONCERNS.md as a thread-pool exhaustion risk under high concurrency. httpx.AsyncClient eliminates this. The existing HTTPClient wrapper interface (get, post returning tuple[int, Any]) can be preserved with an async version.

The two contexts (external scripts vs. embedded in FastAPI) have different constraints and should use different transports. The shared BaseFetcher/BaseSearcher base classes can define the interface; each deployment context plugs in the appropriate client.

TLS Fingerprint Strategy

The HTTPClient class in spiderJobs/core/http_client.py (and its copy in app/services/crawler/_http_client.py) correctly:

  • Sets TLS_CHROME_LATEST as the TLS config
  • Enables random_ja3=True to rotate JA3 hashes per Chrome 110+ behavior
  • Recreates sessions for tunnel proxies to force new TCP connections

Keep this pattern. Do not remove random JA3 — platforms have JA3 allowlists/denylist heuristics, and randomization within the Chrome profile mimics real Chrome behavior.


Retry Logic

Technology Version Purpose Why
tenacity already in Pipfile (Spider-specific) Retry with exponential backoff + jitter Already listed as a spider dependency. Supports async/sync equally. Provides stop_after_attempt, wait_exponential, wait_random_exponential, retry_if_exception_type. Replace all bare except: pass retry patterns in jobs_spider/ with structured tenacity decorators.

Tenacity configuration pattern for crawlers:

from tenacity import retry, stop_after_attempt, wait_random_exponential, retry_if_exception_type
import httpx

@retry(
    stop=stop_after_attempt(3),
    wait=wait_random_exponential(multiplier=1, min=10, max=30),  # preserves 10s+ minimum
    retry=retry_if_exception_type((httpx.TimeoutException, httpx.NetworkError)),
    reraise=True,
)
def fetch_with_retry(...): ...

The min=10 maps to the existing SLEEP_MIN_SECONDS constraint. This replaces the current sleep_random_between() manual sleep pattern while preserving the anti-detection delay requirement.


Base Class Architecture

Recommendation: Keep the existing two-layer pattern from spiderJobs/ but make it the single source of truth.

The spiderJobs/ package has the best-designed structure:

  • core/base.pyApiResult, BaseFetcher, BaseSearcher (platform-agnostic)
  • core/http_client.pyHTTPClient (transport-agnostic)
  • platforms/{boss,job51,zhilian}/{sign,client,api}.py — per-platform implementations

The app/services/crawler/ directory is a copy of this (confirmed by the "复制自" comments). The duplication is the problem, not the pattern.

Resolution: Make spiderJobs/ the authoritative shared package. Install it as an editable package (pip install -e ./spiderJobs) in both the app environment and the standalone spider environment. Remove the copies in app/services/crawler/. This gives one codebase, one set of tests, and eliminates the current copy-paste maintenance burden.

If editable install is not viable for the Docker deployment, use a uv workspace or pyproject.toml at the repo root referencing spiderJobs as a local path dependency.

# pyproject.toml (repo root, new file)
[tool.uv.workspace]
members = ["spiderJobs", "app"]

The base class pattern itself (abstract _build_params, concrete fetch/search) is correct and should be preserved as-is.


Anti-Detection Libraries and Approaches

Concern Approach Library
TLS fingerprint Chrome JA3 impersonation with randomization requests_go (TLS_CHROME_LATEST, random_ja3=True)
HTTP/2 fingerprint Handled by requests_go/curl_cffi Same library
IP blocking Proxy pool rotation + tunnel proxy Custom HTTPClient (already implemented)
Session fingerprint Recreate session on IP switch _new_session() pattern (already in HTTPClient)
Timing detection Minimum 10s random delay between requests tenacity wait_random_exponential(min=10, max=30)
Token expiry Mark token invalid, fetch next from backend Existing SmartIPManager pattern — keep
Anti-bot challenge (code=35/37) Session rebuild + cookie refresh Keep existing logic from BossZhipinAPI

Do NOT add Playwright to the base crawler loop. Playwright adds 200-500ms startup overhead per page and is only justified for JavaScript-rendered anti-bot challenges. Reserve it for edge-case recovery only, not the main fetch path.


Data Deduplication at Insert Time

Strategy: Application-side pre-check with a 30-day recency window + engine-level backstop.

The current batch_dedup_filter in app/services/ingest/dedup.py does a SELECT key_col FROM table WHERE key_col IN (...) with no time bound. CONCERNS.md identifies this as a full-table scan on every ingest batch — a real performance problem as data grows.

Recommended approach:

# Add PREWHERE recency filter to all dedup queries
query = (
    f"SELECT {key_col} FROM {table} "
    f"PREWHERE created_at >= now() - INTERVAL 30 DAY "   # <-- add this
    f"WHERE {key_col} IN {{keys:Array(String)}}"
)

Use PREWHERE (not WHERE) — ClickHouse evaluates PREWHERE before reading full column data, making it significantly cheaper on MergeTree tables. A 30-day window matches the assumption in company_cleaner.py's days_back logic and covers realistic re-crawl cycles.

For ClickHouse table engine: The existing MergeTree tables for job data use a composite order key (job_id, created_at or similar). This is correct — MergeTree does not auto-deduplicate, but the engine-level ORDER BY ensures that a FINAL modifier or GROUP BY query can clean duplicates if needed. Do NOT switch to ReplacingMergeTree for job tables during this refactor — it adds background merge overhead and provides only eventual (not immediate) deduplication, while the application-side pre-check gives immediate guarantees at lower cost.

The existing pending_company table already uses ReplacingMergeTree(updated_at) with (source, company_id) dedup — this is correct and should stay.

Summary: No library change needed for deduplication. Fix the missing PREWHERE time-bound filter in dedup.py. Use PREWHERE over WHERE for all time-range filters in ClickHouse queries.


Testing Stack

Technology Version Purpose Why
pytest latest stable (8.x) Test runner Standard; already used in tests/ (unittest style, but pytest-compatible)
pytest-anyio 0.0.0+ (or anyio[trio]) Async test support Project uses anyio==4.8.0 already in Pipfile. pytest-anyio integrates directly with the anyio backend already present. Use @pytest.mark.anyio on async tests.
respx 0.21.x Mock httpx requests Intercepts httpx.AsyncClient calls without hitting the network. Use in all tests that exercise app/services/crawler/ (the async versions). Pattern-matches on URL, method, headers.
unittest.mock stdlib Mock requests_go calls in external scripts For tests of jobs_spider/ and spiderJobs/ which use requests_go (synchronous), standard unittest.mock.patch on requests_go.Session.get/post is sufficient. No extra library needed.
pytest-cov 4.x Coverage measurement Enforce 80% coverage minimum on spiderJobs/core/, spiderJobs/platforms/*/sign.py, and all pure-function parsers.

Testing patterns for crawlers:

  1. Sign algorithm tests — Pure functions, no HTTP, no mocks needed. BossSign.generate_traceid(), ZhilianSign.sign_headers(), _compute_checksum() are all deterministic enough to test with snapshot assertions. Test these first — they are the highest-value, lowest-cost tests.

  2. API parser tests — Feed recorded real response JSON fixtures into BaseFetcher._parse() / platform-specific _parse() overrides. Verify ApiResult fields. Use pytest.fixture to load JSON from tests/fixtures/. These tests do not need HTTP at all.

  3. HTTPClient / BaseFetcher integration tests — Use unittest.mock.patch on requests_go.Session to inject fixed (status_code, json) return values. Verify the client calls the right URL, sends correct headers, and handles errors.

  4. Ingest/dedup tests — The batch_dedup_filter function takes an AsyncClient — use respx to mock ClickHouse query responses. Test: (a) single-column dedup filters correctly, (b) two-column dedup, (c) time-bound filter is present in generated query.

  5. Integration/smoke tests — One test per platform that replays a recorded full HTTP exchange (using respx route replay or responses library) and verifies the full pipeline from BaseFetcher.fetch() through build_insert_row(). Not against a real server.

Do NOT use pytest-asyncio alongside pytest-anyio — the two event loop managers conflict. The project already has anyio as a hard dependency (FastAPI uses it); pytest-anyio is the natural extension.

Existing test pattern: Current tests in tests/ use unittest.TestCase. These should be migrated to plain pytest style (no class inheritance) for simplicity, but this is not blocking.


Code Organization for Shared Core

Problem: Three parallel implementations of the same thing exist:

  • jobs_spider/boss/boos_api.py — 2281-line monolith (old)
  • spiderJobs/platforms/boss/ — split correctly into sign/client/api (new)
  • app/services/crawler/ — copy of spiderJobs/ with import paths changed

Recommended layout after refactor:

spiderJobs/              # Canonical shared package — install as editable dep
  core/
    base.py              # ApiResult, BaseFetcher, BaseSearcher
    http_client.py       # HTTPClient (requests_go transport)
    http_client_async.py # AsyncHTTPClient (httpx.AsyncClient transport)  <-- NEW
  platforms/
    boss/
      sign.py            # BossSign (pure, no HTTP)
      client.py          # BossClient(HTTPClient)
      client_async.py    # BossAsyncClient(AsyncHTTPClient)  <-- NEW for app embedding
      api.py             # Boss search/fetch using BossClient
    job51/
      sign.py
      client.py
      api.py
    zhilian/
      sign.py
      client.py
      api.py
  runner/
    loop.py              # Main crawl loop (keyword fetch → crawl → push)
    api_client.py        # Push to backend API

jobs_spider/             # Entry-point scripts only, import from spiderJobs
  boss/main.py
  qcwy/main.py
  zhilian/main.py

app/services/crawler/    # App-embedded adapters, import from spiderJobs
  boss.py                # BossService: thin wrapper using BossAsyncClient
  qcwy.py
  zhilian.py

The *_async.py variant of each client wraps httpx.AsyncClient with the same interface. Sign classes are transport-agnostic and shared.


Alternatives Considered

Category Recommended Alternative Why Not
TLS-aware HTTP client requests_go (crawlers) curl_cffi Both equally capable. requests_go already integrated; switching mid-refactor adds risk. curl_cffi is the better long-term bet but not worth the disruption now.
Async HTTP (in-app) httpx.AsyncClient aiohttp httpx already in Pipfile at 0.28.1. aiohttp API is more complex, no advantage here.
Async HTTP (in-app) httpx.AsyncClient requests_go async requests_go's async mode is less documented; httpx is the FastAPI-idiomatic choice and has respx for testing.
Retry tenacity Manual sleep loops Tenacity already in Pipfile; structured retries with jitter are safer and testable.
Dedup strategy App-side + PREWHERE time window ReplacingMergeTree only ReplacingMergeTree is eventual (not immediate); duplicates appear in queries until merge runs. App-side gives immediate guarantees at the cost of one extra SELECT per batch — acceptable.
Dedup strategy App-side + PREWHERE time window Full-table scan (current) Current approach is a correctness/performance regression as table grows past millions of rows.
Test mocking respx (httpx) + unittest.mock (requests_go) responses, pytest-mock respx is the purpose-built httpx mock; unittest.mock is stdlib. Minimum new dependencies.
Async test runner pytest-anyio pytest-asyncio Project already depends on anyio 4.8.0. Mixing pytest-asyncio creates event loop conflicts.
Shared code structure Editable install (pip install -e) File copies (current) Current copy-paste approach (confirmed by "复制自" comments) guarantees drift. Editable install or uv workspace gives one truth.

Versions and Installation

# No new major libraries needed for the refactor.
# Changes to existing dependencies:

# Add to Pipfile [packages]:
requests_go = "==1.0.9"     # Pin to latest verified version
tenacity = "*"               # Already in Pipfile as spider-specific
pytest = ">=8.0"             # Dev dependency
pytest-anyio = "*"           # Dev dependency
respx = ">=0.21"             # Dev dependency
pytest-cov = ">=4.0"         # Dev dependency

# Install spiderJobs as editable (development):
pip install -e ./spiderJobs

# Or with uv workspace (production-safe):
# Add pyproject.toml at repo root (see Code Organization section above)

Key Constraints Respected

  • No new HTTP library for the main pathrequests_go stays for external scripts. Zero regression risk.
  • Async for in-app crawlershttpx is already in Pipfile. The asyncio.to_thread() workaround in CONCERNS.md is eliminated.
  • 10-second minimum delay preservedtenacity wait_random_exponential(min=10) enforces this constraint.
  • Dedup at insert time — Application-side batch_dedup_filter with time-bounded PREWHERE. No schema changes.
  • Shared signing/parsing code — Single canonical spiderJobs/ package, installed into both environments.
  • ClickHouse table structure unchanged — No DDL changes required for any recommendation here.

Sources


Stack research: 2026-03-21