20 KiB
Architecture Patterns: Shared Crawler Core
Domain: Recruiting data crawler — dual-deployment shared core Researched: 2026-03-21 Overall confidence: HIGH (based entirely on direct codebase inspection)
Problem Statement
The codebase currently has the same code in two places, copied verbatim:
| File | Copy location |
|---|---|
spiderJobs/core/http_client.py |
app/services/crawler/_http_client.py |
spiderJobs/core/base.py |
app/services/crawler/_base.py |
spiderJobs/platforms/boss/sign.py |
app/services/crawler/_boss_sign.py |
spiderJobs/platforms/boss/client.py |
app/services/crawler/_boss_client.py |
spiderJobs/platforms/boss/api.py |
app/services/crawler/_boss_api.py |
The docstrings in app/services/crawler/_http_client.py literally say: "复制自 spiderJobs/core/http_client.py — 不要直接 import spiderJobs,避免跨模块依赖" ("copied from — do not import directly to avoid cross-module dependency"). The comment reveals the source of the problem: the team correctly identified that two separate deployment units cannot cross-import, but solved it by copy-paste rather than extraction.
There are also two competing external spider frameworks: jobs_spider/ (older, monolithic, single large file per platform) and spiderJobs/ (newer, modular, uses runner/loop.py + runner/api_client.py as a proper execution harness). The refactor must unify these.
Recommended Architecture
Core Design Decision: Installable Shared Package
Extract shared crawler code into a proper Python package at crawler_core/ (project root level). Both deployment contexts import from it. No copying. No circular imports.
crawler_core/ # Shared library — zero app/* or spiderJobs/* imports
├── __init__.py
├── http_client.py # HTTPClient (requests-go, TLS fingerprint, proxy rotation)
├── base.py # ApiResult, BaseFetcher, BaseSearcher, parse_response()
├── boss/
│ ├── __init__.py
│ ├── sign.py # BossSign (Traceid generation algorithm)
│ ├── client.py # BossClient(HTTPClient) — header injection
│ └── api.py # SearchRecJobs, GetBrandDetail, etc.
├── qcwy/
│ ├── __init__.py
│ ├── sign.py
│ ├── client.py
│ └── api.py
└── zhilian/
├── __init__.py
├── sign.py
├── client.py
└── api.py
Why a root-level package and not inside app/ or spiderJobs/:
app/is the FastAPI backend. Importing it from external scripts would pull in Tortoise-ORM, APScheduler, ClickHouse, and every other backend dependency — making external scripts impossibly heavy.spiderJobs/is a deployment directory, not a library.- A root-level
crawler_core/has no such baggage. It imports onlyrequests-goand stdlib. Bothapp/andspiderJobs/import from it.
Constraints satisfied:
- No circular imports
- No shared runtime state (crawler_core is stateless pure logic)
- Installable as local package (editable install via
pip install -e .orpipenv install -e .) for ECS deployment - Tests can import
crawler_coredirectly without mocking a FastAPI app
Component Boundaries
What Talks to What
┌─────────────────────────────────────────────────────────────┐
│ crawler_core/ │
│ HTTPClient · BaseFetcher · BaseSearcher · ApiResult │
│ boss/ · qcwy/ · zhilian/ (sign + client + api) │
│ Imports: requests-go, stdlib only │
└────────────────┬────────────────────────┬───────────────────┘
│ │
┌────────────▼──────────┐ ┌─────────▼─────────────────┐
│ app/services/ │ │ spiderJobs/ │
│ crawler/ │ │ platforms/ + runner/ │
│ │ │ │
│ BossService facade │ │ run_crawl_loop() │
│ QcwyService facade │ │ RunnerAPIClient │
│ ZhilianService facade│ │ (HTTP push to backend) │
│ │ │ │
│ Called by: │ │ Called by: │
│ CompanyCleaner │ │ ECS node scripts │
│ CompanyController │ │ jobs_spider/ (deprecated) │
└────────────┬──────────┘ └────────────────────────────┘
│
┌────────────▼──────────────────────────────────────────┐
│ app/ (FastAPI backend — async boundary) │
│ CompanyCleaner (asyncio, calls sync crawler via │
│ asyncio.to_thread()) │
│ IngestService (async ClickHouse writes) │
│ APScheduler (triggers CompanyCleaner async jobs) │
└───────────────────────────────────────────────────────┘
Component Responsibilities
| Component | Responsibility | What It Must NOT Do |
|---|---|---|
crawler_core/ |
Signing, HTTP requests, response parsing — per platform | Import app.*, use async, maintain global state |
app/services/crawler/ facades |
Wrap crawler_core with logging, token loading, proxy mgmt |
Contain signing or HTTP logic |
spiderJobs/runner/ |
Crawl loop orchestration, keyword polling, data upload | Import app.* |
app/services/ingest/ |
Registry-driven ClickHouse insertion, dedup | Know anything about platform HTTP clients |
app/core/scheduler.py |
Schedule jobs, manage distributed lock | Run synchronous I/O on the event loop directly |
Abstract Base Class Hierarchy
# crawler_core/base.py
@dataclass
class ApiResult:
success: bool
status_code: int
data: Any = None
list: list[dict] = field(default_factory=list)
count: int = 0
is_end_page: bool = True
error: Optional[str] = None
class BaseFetcher:
"""Single-resource GET (company detail, job detail)."""
ENDPOINT: str = ""
def __init__(self, http_client: HTTPClient): ...
def _build_params(self) -> dict: raise NotImplementedError
def _parse(self, http_code: int, raw: Any) -> ApiResult: ...
def fetch(self) -> ApiResult: ... # sync — wraps _http.get()
class BaseSearcher:
"""Paginated search (job list, brand job list)."""
ENDPOINT: str = ""
def __init__(self, page_size: int, http_client: HTTPClient): ...
def _build_params(self, page_index: int) -> dict: raise NotImplementedError
def _request(self, params: dict) -> tuple[int, Any]: ...
def _parse(self, http_code: int, raw: Any) -> ApiResult: ...
def search(self, page_index: int = 1) -> ApiResult: ... # sync
def load_all(self, max_pages: int, on_page: Callable | None) -> list[dict]: ...
Platform clients subclass HTTPClient from crawler_core/http_client.py:
# crawler_core/boss/client.py
class BossClient(HTTPClient):
"""Injects mpt/wt2/Traceid headers on every request."""
def __init__(self, signer: BossSign, ...): ...
def post(self, path, body, headers=None) -> tuple[int, Any]: ... # sync
def get(self, path, params=None, headers=None) -> tuple[int, Any]: ... # sync
Platform API classes subclass BaseFetcher / BaseSearcher:
# crawler_core/boss/api.py
class SearchRecJobs(BaseSearcher):
ENDPOINT = "/wapi/zpgeek/miniapp/homepage/recjoblist.json"
def _build_params(self, page_index: int) -> dict: ...
def _parse(self, http_code: int, raw: Any) -> ApiResult: ... # Boss-specific parsing
class GetBrandDetail(BaseFetcher):
ENDPOINT = "/wapi/zpgeek/miniapp/brand/get.json"
def _build_params(self) -> dict: ...
The hierarchy is deliberately flat. Three levels: HTTPClient → PlatformClient → PlatformAPIClass. No further abstraction.
Sync vs Async: The Critical Design Point
Why the Core Must Stay Synchronous
crawler_core/ uses requests-go (a sync library that wraps a TLS fingerprinting mechanism). There is no async version of requests-go. The platform APIs require TLS fingerprint spoofing that httpx does not support without significant additional work. Keeping the core synchronous is the correct choice given the constraint.
How the Backend (async) Calls Synchronous Crawlers
CompanyCleaner runs inside the FastAPI async event loop (called by APScheduler). It cannot block the event loop with synchronous requests calls. The correct pattern is asyncio.to_thread():
# app/services/company_cleaner.py
async def _fetch_company_detail(self, source: str, company_id: str) -> Optional[dict]:
"""Run synchronous crawler call off the event loop thread."""
if source == "boss":
return await asyncio.to_thread(
self.boss_service.get_company_detail_by_id, company_id
)
elif source == "qcwy":
return await asyncio.to_thread(
self.qcwy_service.get_company_detail_by_id, company_id
)
elif source == "zhilian":
return await asyncio.to_thread(
self.zhilian_service.get_company_detail_by_id, company_id
)
asyncio.to_thread() runs the synchronous function in a thread pool, yielding back to the event loop while it executes. This is the standard Python pattern for sync I/O in an async context.
Current state: The existing CompanyCleaner.process_pending_companies() likely calls sync methods directly on the event loop — this is a bug to fix during refactoring. The fix is wrapping each crawler call with asyncio.to_thread().
External Scripts: Pure Synchronous, No Async
spiderJobs/runner/loop.py (run_crawl_loop) and any ECS scripts are pure synchronous while True loops. They import from crawler_core/ directly. No async. This is correct and intentional.
External Script (sync) Backend (async)
────────────────────── ─────────────────────────────
while True: async def process_company():
kw = api.fetch_keyword() detail = await asyncio.to_thread(
searcher = BossSearcher(...) boss_service.get_company_detail
result = searcher.search() )
api.upload_data(result.list) await company_storage.save(detail)
Data Flow Patterns
Flow 1: External Spider → Backend Storage
ECS Script (spiderJobs/runner/loop.py)
│
├─ GET /api/v1/keyword/available
│ → KeywordController marks keyword as "crawling"
│
├─ crawler_core/boss/api.py SearchRecJobs.search(page)
│ → returns ApiResult.list (raw job dicts)
│
├─ POST /api/v1/universal/data/batch-store-async
│ body: {platform, channel, data_type, data_list}
│
└─ Backend: IngestService.store_batch()
→ registry lookup → dedup → ClickHouse insert
The external script knows nothing about ClickHouse. It only pushes raw dicts over HTTP. This boundary is correct and must stay.
Flow 2: In-Process Company Cleaning (Scheduled)
APScheduler → company_cleaning_job() [async]
│
├─ CompanyCleaner.collect_pending_companies() [async]
│ → ClickHouse: query job tables for company IDs
│ → MySQL: check which company IDs already processed
│ → MySQL: insert new companies into CompanyCleaningQueue
│
└─ CompanyCleaner.process_pending_companies() [async]
├─ MySQL: fetch N companies from queue
├─ asyncio.to_thread(boss_service.get_company_detail_by_id)
│ → crawler_core/boss/api.py GetBrandDetail.fetch() [sync]
├─ company_storage.save_company() [async, ClickHouse]
└─ MySQL: update company status to "done"
The flow crosses the sync/async boundary exactly once: at asyncio.to_thread(). Everything above (ClickHouse queries, MySQL operations) stays async. Everything below (HTTP to Boss API) stays sync.
Flow 3: Company Jobs Sync (new, post-refactor)
After a company is cleaned, the system should backfill that company's current job listings:
CompanyJobsSyncService.sync_company_jobs(company_id, source) [async]
├─ asyncio.to_thread(boss_service.get_company_jobs_by_id, company_id)
│ → crawler_core/boss/api.py SearchBrandJobs.search() [sync]
└─ IngestService.store_batch(platform="boss", data_type="company_job")
→ ClickHouse insert into boss_company_job table (new table needed)
Package Structure: Suggested Build Order
Build crawler_core/ first — everything else depends on it.
Phase 1: Extract crawler_core/
- Create
crawler_core/__init__.py,crawler_core/http_client.py,crawler_core/base.py- These are identical to the current
spiderJobs/core/files with one import path change - Zero functional change; pure extraction
- These are identical to the current
- Create
crawler_core/boss/,crawler_core/qcwy/,crawler_core/zhilian/- Extract from
spiderJobs/platforms/boss/, qcwy/, zhilian/ - Update internal imports to
from crawler_core.boss.sign import BossSign, etc.
- Extract from
- Add
crawler_coretoPipfileas an editable local dependency:
(Or configure[packages] crawler_core = {path = ".", editable = true}pyproject.tomlwith[tool.setuptools.packages.find]includingcrawler_core)
Verification gate: spiderJobs/ and app/services/crawler/ both import from crawler_core without errors. spiderJobs/core/ and spiderJobs/platforms/ can be deleted.
Phase 2: Rewire app/services/crawler/ Facades
The service facades (boss.py, qcwy.py, zhilian.py) stay in app/services/crawler/ — they belong to the backend. Their job is to:
- Load tokens from MySQL (
BossToken) - Load proxies from
ProxyController - Instantiate
crawler_coreclients with those credentials - Expose typed async-friendly methods (
get_company_detail_by_id) - Wrap
asyncio.to_thread()for each sync call
The private underscore modules (_boss_client.py, _boss_sign.py, etc.) are deleted. Facades import directly from crawler_core.
# app/services/crawler/boss.py (after refactor)
from crawler_core.boss.client import BossClient, create_client
from crawler_core.boss.sign import BossSign
from crawler_core.boss.api import GetBrandDetail, SearchBrandJobs
Verification gate: CompanyCleaner test passes with mocked crawler_core calls.
Phase 3: Rewire spiderJobs/ External Scripts
spiderJobs/runner/ imports from spiderJobs.core.base import ... → change to from crawler_core.base import .... Delete spiderJobs/core/ and spiderJobs/platforms/.
spiderJobs/platforms/boss/main.py becomes a thin entry point:
from crawler_core.boss.client import create_client
from crawler_core.boss.api import SearchRecJobs
from spiderJobs.runner.loop import run_crawl_loop
jobs_spider/ (the older framework with monolithic files) is deprecated in favor of spiderJobs/ + crawler_core/.
Anti-Patterns to Avoid
Anti-Pattern 1: Async Crawler Core
What: Making crawler_core/ async because the backend is async.
Why bad: requests-go is sync. The TLS fingerprint mechanism is sync. Wrapping everything in asyncio.run() or creating fake async wrappers adds complexity with no benefit. The asyncio.to_thread() boundary at the facade layer is the correct and idiomatic solution.
Instead: Keep crawler_core/ synchronous. Use asyncio.to_thread() in the backend facades.
Anti-Pattern 2: Putting RunnerAPIClient in crawler_core/
What: Moving the HTTP push client (backend API caller) into the shared core.
Why bad: RunnerAPIClient is deployment-specific: it knows about /api/v1/keyword/available, /api/v1/universal/data/batch-store-async, and the backend's URL scheme. That is runner orchestration logic, not platform API logic.
Instead: RunnerAPIClient stays in spiderJobs/runner/api_client.py. It is not shared.
Anti-Pattern 3: Importing app.* from External Scripts
What: Having ECS scripts import from app.settings.config, app.models, or app.core.*.
Why bad: That would require Tortoise-ORM initialization, database connections, and all backend startup logic to run on ECS nodes. This would fail.
Instead: External scripts get all config from environment variables. They communicate with the backend only via HTTP.
Anti-Pattern 4: Blocking the Event Loop with Sync Crawler Calls
What: Calling boss_service.get_company_detail_by_id() directly in an async def without asyncio.to_thread().
Why bad: Each HTTP call takes 2-10 seconds. With 30 companies to process, that blocks the event loop for up to 5 minutes. Other requests, heartbeats, and scheduler jobs starve.
Instead: Always wrap sync crawler calls in asyncio.to_thread() inside async context.
Anti-Pattern 5: Monolithic Platform Files
What: Keeping the existing jobs_spider/boss/boos_api.py (2246 lines) pattern in the refactored code.
Why bad: Impossible to test individual components. Signing, HTTP, and business logic are entangled.
Instead: Three files per platform: sign.py (pure functions, no I/O), client.py (HTTP wrapper, no business logic), api.py (endpoint classes).
Scalability Considerations
| Concern | Current | After Refactor |
|---|---|---|
| Adding a 4th platform (e.g., job51) | Copy files into both spiderJobs/ and app/services/crawler/ |
Add crawler_core/job51/ once; both consumers get it |
| Testing signing algorithm | Must run full crawler | Test crawler_core/boss/sign.py in isolation — pure functions, no I/O |
| Updating User-Agent headers | Two files to update | One file: crawler_core/boss/client.py |
| Running multiple concurrent companies | Single-threaded | asyncio.to_thread() uses thread pool; default pool size = CPU count × 5 |
| ECS deployment of crawlers | Must deploy spiderJobs/ + copy of core |
Deploy crawler_core/ + spiderJobs/; single source of truth |
Suggested Build Order for Roadmap Phases
Dependencies flow in one direction:
Phase 1: crawler_core/ extraction
└─ Produces: installable shared package, 3-platform coverage
Phase 2: app/services/crawler/ facade rewire
└─ Depends on: Phase 1 (imports crawler_core)
└─ Produces: asyncio.to_thread() boundary, deleted underscore copies
Phase 3: spiderJobs/ rewire + jobs_spider/ deprecation
└─ Depends on: Phase 1 (imports crawler_core)
└─ Produces: single external runner codebase
Phase 4: Company cleaning flow optimization
└─ Depends on: Phase 2 (async-safe facades)
└─ Produces: CompanyJobsSyncService using new facades
Phase 5: Unit tests
└─ Depends on: Phase 1 (crawler_core is testable in isolation)
└─ crawler_core tests: no mocking needed for sign.py (pure functions)
└─ facade tests: mock crawler_core, test async wrapping
Phases 2 and 3 can run in parallel (both depend only on Phase 1, not each other).
Sources
- Direct inspection:
/Users/win/2025/AICoding/JobData/app/services/crawler/_http_client.py(docstring confirms copy fromspiderJobs/core/http_client.py) - Direct inspection:
/Users/win/2025/AICoding/JobData/spiderJobs/core/base.pyvsapp/services/crawler/_base.py— byte-identical logic - Direct inspection:
/Users/win/2025/AICoding/JobData/app/services/company_cleaner.py— async consumer of sync crawlers - Direct inspection:
/Users/win/2025/AICoding/JobData/spiderJobs/runner/loop.py— sync-only external runner - Direct inspection:
/Users/win/2025/AICoding/JobData/.planning/codebase/ARCHITECTURE.md— existing architecture analysis
Architecture analysis: 2026-03-21