424 lines
20 KiB
Markdown
424 lines
20 KiB
Markdown
# Architecture Patterns: Shared Crawler Core
|
||
|
||
**Domain:** Recruiting data crawler — dual-deployment shared core
|
||
**Researched:** 2026-03-21
|
||
**Overall confidence:** HIGH (based entirely on direct codebase inspection)
|
||
|
||
---
|
||
|
||
## Problem Statement
|
||
|
||
The codebase currently has the same code in two places, copied verbatim:
|
||
|
||
| File | Copy location |
|
||
|------|---------------|
|
||
| `spiderJobs/core/http_client.py` | `app/services/crawler/_http_client.py` |
|
||
| `spiderJobs/core/base.py` | `app/services/crawler/_base.py` |
|
||
| `spiderJobs/platforms/boss/sign.py` | `app/services/crawler/_boss_sign.py` |
|
||
| `spiderJobs/platforms/boss/client.py` | `app/services/crawler/_boss_client.py` |
|
||
| `spiderJobs/platforms/boss/api.py` | `app/services/crawler/_boss_api.py` |
|
||
|
||
The docstrings in `app/services/crawler/_http_client.py` literally say: "复制自 spiderJobs/core/http_client.py — 不要直接 import spiderJobs,避免跨模块依赖" ("copied from — do not import directly to avoid cross-module dependency"). The comment reveals the source of the problem: the team correctly identified that two separate deployment units cannot cross-import, but solved it by copy-paste rather than extraction.
|
||
|
||
There are also two competing external spider frameworks: `jobs_spider/` (older, monolithic, single large file per platform) and `spiderJobs/` (newer, modular, uses `runner/loop.py` + `runner/api_client.py` as a proper execution harness). The refactor must unify these.
|
||
|
||
---
|
||
|
||
## Recommended Architecture
|
||
|
||
### Core Design Decision: Installable Shared Package
|
||
|
||
Extract shared crawler code into a proper Python package at `crawler_core/` (project root level). Both deployment contexts import from it. No copying. No circular imports.
|
||
|
||
```
|
||
crawler_core/ # Shared library — zero app/* or spiderJobs/* imports
|
||
├── __init__.py
|
||
├── http_client.py # HTTPClient (requests-go, TLS fingerprint, proxy rotation)
|
||
├── base.py # ApiResult, BaseFetcher, BaseSearcher, parse_response()
|
||
├── boss/
|
||
│ ├── __init__.py
|
||
│ ├── sign.py # BossSign (Traceid generation algorithm)
|
||
│ ├── client.py # BossClient(HTTPClient) — header injection
|
||
│ └── api.py # SearchRecJobs, GetBrandDetail, etc.
|
||
├── qcwy/
|
||
│ ├── __init__.py
|
||
│ ├── sign.py
|
||
│ ├── client.py
|
||
│ └── api.py
|
||
└── zhilian/
|
||
├── __init__.py
|
||
├── sign.py
|
||
├── client.py
|
||
└── api.py
|
||
```
|
||
|
||
**Why a root-level package and not inside `app/` or `spiderJobs/`:**
|
||
- `app/` is the FastAPI backend. Importing it from external scripts would pull in Tortoise-ORM, APScheduler, ClickHouse, and every other backend dependency — making external scripts impossibly heavy.
|
||
- `spiderJobs/` is a deployment directory, not a library.
|
||
- A root-level `crawler_core/` has no such baggage. It imports only `requests-go` and stdlib. Both `app/` and `spiderJobs/` import from it.
|
||
|
||
**Constraints satisfied:**
|
||
- No circular imports
|
||
- No shared runtime state (crawler_core is stateless pure logic)
|
||
- Installable as local package (editable install via `pip install -e .` or `pipenv install -e .`) for ECS deployment
|
||
- Tests can import `crawler_core` directly without mocking a FastAPI app
|
||
|
||
---
|
||
|
||
## Component Boundaries
|
||
|
||
### What Talks to What
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ crawler_core/ │
|
||
│ HTTPClient · BaseFetcher · BaseSearcher · ApiResult │
|
||
│ boss/ · qcwy/ · zhilian/ (sign + client + api) │
|
||
│ Imports: requests-go, stdlib only │
|
||
└────────────────┬────────────────────────┬───────────────────┘
|
||
│ │
|
||
┌────────────▼──────────┐ ┌─────────▼─────────────────┐
|
||
│ app/services/ │ │ spiderJobs/ │
|
||
│ crawler/ │ │ platforms/ + runner/ │
|
||
│ │ │ │
|
||
│ BossService facade │ │ run_crawl_loop() │
|
||
│ QcwyService facade │ │ RunnerAPIClient │
|
||
│ ZhilianService facade│ │ (HTTP push to backend) │
|
||
│ │ │ │
|
||
│ Called by: │ │ Called by: │
|
||
│ CompanyCleaner │ │ ECS node scripts │
|
||
│ CompanyController │ │ jobs_spider/ (deprecated) │
|
||
└────────────┬──────────┘ └────────────────────────────┘
|
||
│
|
||
┌────────────▼──────────────────────────────────────────┐
|
||
│ app/ (FastAPI backend — async boundary) │
|
||
│ CompanyCleaner (asyncio, calls sync crawler via │
|
||
│ asyncio.to_thread()) │
|
||
│ IngestService (async ClickHouse writes) │
|
||
│ APScheduler (triggers CompanyCleaner async jobs) │
|
||
└───────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### Component Responsibilities
|
||
|
||
| Component | Responsibility | What It Must NOT Do |
|
||
|-----------|---------------|---------------------|
|
||
| `crawler_core/` | Signing, HTTP requests, response parsing — per platform | Import `app.*`, use async, maintain global state |
|
||
| `app/services/crawler/` facades | Wrap `crawler_core` with logging, token loading, proxy mgmt | Contain signing or HTTP logic |
|
||
| `spiderJobs/runner/` | Crawl loop orchestration, keyword polling, data upload | Import `app.*` |
|
||
| `app/services/ingest/` | Registry-driven ClickHouse insertion, dedup | Know anything about platform HTTP clients |
|
||
| `app/core/scheduler.py` | Schedule jobs, manage distributed lock | Run synchronous I/O on the event loop directly |
|
||
|
||
---
|
||
|
||
## Abstract Base Class Hierarchy
|
||
|
||
```python
|
||
# crawler_core/base.py
|
||
|
||
@dataclass
|
||
class ApiResult:
|
||
success: bool
|
||
status_code: int
|
||
data: Any = None
|
||
list: list[dict] = field(default_factory=list)
|
||
count: int = 0
|
||
is_end_page: bool = True
|
||
error: Optional[str] = None
|
||
|
||
|
||
class BaseFetcher:
|
||
"""Single-resource GET (company detail, job detail)."""
|
||
ENDPOINT: str = ""
|
||
|
||
def __init__(self, http_client: HTTPClient): ...
|
||
def _build_params(self) -> dict: raise NotImplementedError
|
||
def _parse(self, http_code: int, raw: Any) -> ApiResult: ...
|
||
def fetch(self) -> ApiResult: ... # sync — wraps _http.get()
|
||
|
||
|
||
class BaseSearcher:
|
||
"""Paginated search (job list, brand job list)."""
|
||
ENDPOINT: str = ""
|
||
|
||
def __init__(self, page_size: int, http_client: HTTPClient): ...
|
||
def _build_params(self, page_index: int) -> dict: raise NotImplementedError
|
||
def _request(self, params: dict) -> tuple[int, Any]: ...
|
||
def _parse(self, http_code: int, raw: Any) -> ApiResult: ...
|
||
def search(self, page_index: int = 1) -> ApiResult: ... # sync
|
||
def load_all(self, max_pages: int, on_page: Callable | None) -> list[dict]: ...
|
||
```
|
||
|
||
Platform clients subclass `HTTPClient` from `crawler_core/http_client.py`:
|
||
|
||
```python
|
||
# crawler_core/boss/client.py
|
||
class BossClient(HTTPClient):
|
||
"""Injects mpt/wt2/Traceid headers on every request."""
|
||
def __init__(self, signer: BossSign, ...): ...
|
||
def post(self, path, body, headers=None) -> tuple[int, Any]: ... # sync
|
||
def get(self, path, params=None, headers=None) -> tuple[int, Any]: ... # sync
|
||
```
|
||
|
||
Platform API classes subclass `BaseFetcher` / `BaseSearcher`:
|
||
|
||
```python
|
||
# crawler_core/boss/api.py
|
||
class SearchRecJobs(BaseSearcher):
|
||
ENDPOINT = "/wapi/zpgeek/miniapp/homepage/recjoblist.json"
|
||
def _build_params(self, page_index: int) -> dict: ...
|
||
def _parse(self, http_code: int, raw: Any) -> ApiResult: ... # Boss-specific parsing
|
||
|
||
class GetBrandDetail(BaseFetcher):
|
||
ENDPOINT = "/wapi/zpgeek/miniapp/brand/get.json"
|
||
def _build_params(self) -> dict: ...
|
||
```
|
||
|
||
**The hierarchy is deliberately flat.** Three levels: `HTTPClient` → `PlatformClient` → `PlatformAPIClass`. No further abstraction.
|
||
|
||
---
|
||
|
||
## Sync vs Async: The Critical Design Point
|
||
|
||
### Why the Core Must Stay Synchronous
|
||
|
||
`crawler_core/` uses `requests-go` (a sync library that wraps a TLS fingerprinting mechanism). There is no async version of `requests-go`. The platform APIs require TLS fingerprint spoofing that `httpx` does not support without significant additional work. Keeping the core synchronous is the correct choice given the constraint.
|
||
|
||
### How the Backend (async) Calls Synchronous Crawlers
|
||
|
||
`CompanyCleaner` runs inside the FastAPI async event loop (called by APScheduler). It cannot block the event loop with synchronous `requests` calls. The correct pattern is `asyncio.to_thread()`:
|
||
|
||
```python
|
||
# app/services/company_cleaner.py
|
||
|
||
async def _fetch_company_detail(self, source: str, company_id: str) -> Optional[dict]:
|
||
"""Run synchronous crawler call off the event loop thread."""
|
||
if source == "boss":
|
||
return await asyncio.to_thread(
|
||
self.boss_service.get_company_detail_by_id, company_id
|
||
)
|
||
elif source == "qcwy":
|
||
return await asyncio.to_thread(
|
||
self.qcwy_service.get_company_detail_by_id, company_id
|
||
)
|
||
elif source == "zhilian":
|
||
return await asyncio.to_thread(
|
||
self.zhilian_service.get_company_detail_by_id, company_id
|
||
)
|
||
```
|
||
|
||
`asyncio.to_thread()` runs the synchronous function in a thread pool, yielding back to the event loop while it executes. This is the standard Python pattern for sync I/O in an async context.
|
||
|
||
**Current state:** The existing `CompanyCleaner.process_pending_companies()` likely calls sync methods directly on the event loop — this is a bug to fix during refactoring. The fix is wrapping each crawler call with `asyncio.to_thread()`.
|
||
|
||
### External Scripts: Pure Synchronous, No Async
|
||
|
||
`spiderJobs/runner/loop.py` (`run_crawl_loop`) and any ECS scripts are pure synchronous `while True` loops. They import from `crawler_core/` directly. No async. This is correct and intentional.
|
||
|
||
```
|
||
External Script (sync) Backend (async)
|
||
────────────────────── ─────────────────────────────
|
||
while True: async def process_company():
|
||
kw = api.fetch_keyword() detail = await asyncio.to_thread(
|
||
searcher = BossSearcher(...) boss_service.get_company_detail
|
||
result = searcher.search() )
|
||
api.upload_data(result.list) await company_storage.save(detail)
|
||
```
|
||
|
||
---
|
||
|
||
## Data Flow Patterns
|
||
|
||
### Flow 1: External Spider → Backend Storage
|
||
|
||
```
|
||
ECS Script (spiderJobs/runner/loop.py)
|
||
│
|
||
├─ GET /api/v1/keyword/available
|
||
│ → KeywordController marks keyword as "crawling"
|
||
│
|
||
├─ crawler_core/boss/api.py SearchRecJobs.search(page)
|
||
│ → returns ApiResult.list (raw job dicts)
|
||
│
|
||
├─ POST /api/v1/universal/data/batch-store-async
|
||
│ body: {platform, channel, data_type, data_list}
|
||
│
|
||
└─ Backend: IngestService.store_batch()
|
||
→ registry lookup → dedup → ClickHouse insert
|
||
```
|
||
|
||
The external script knows nothing about ClickHouse. It only pushes raw dicts over HTTP. This boundary is correct and must stay.
|
||
|
||
### Flow 2: In-Process Company Cleaning (Scheduled)
|
||
|
||
```
|
||
APScheduler → company_cleaning_job() [async]
|
||
│
|
||
├─ CompanyCleaner.collect_pending_companies() [async]
|
||
│ → ClickHouse: query job tables for company IDs
|
||
│ → MySQL: check which company IDs already processed
|
||
│ → MySQL: insert new companies into CompanyCleaningQueue
|
||
│
|
||
└─ CompanyCleaner.process_pending_companies() [async]
|
||
├─ MySQL: fetch N companies from queue
|
||
├─ asyncio.to_thread(boss_service.get_company_detail_by_id)
|
||
│ → crawler_core/boss/api.py GetBrandDetail.fetch() [sync]
|
||
├─ company_storage.save_company() [async, ClickHouse]
|
||
└─ MySQL: update company status to "done"
|
||
```
|
||
|
||
The flow crosses the sync/async boundary exactly once: at `asyncio.to_thread()`. Everything above (ClickHouse queries, MySQL operations) stays async. Everything below (HTTP to Boss API) stays sync.
|
||
|
||
### Flow 3: Company Jobs Sync (new, post-refactor)
|
||
|
||
After a company is cleaned, the system should backfill that company's current job listings:
|
||
|
||
```
|
||
CompanyJobsSyncService.sync_company_jobs(company_id, source) [async]
|
||
├─ asyncio.to_thread(boss_service.get_company_jobs_by_id, company_id)
|
||
│ → crawler_core/boss/api.py SearchBrandJobs.search() [sync]
|
||
└─ IngestService.store_batch(platform="boss", data_type="company_job")
|
||
→ ClickHouse insert into boss_company_job table (new table needed)
|
||
```
|
||
|
||
---
|
||
|
||
## Package Structure: Suggested Build Order
|
||
|
||
Build `crawler_core/` first — everything else depends on it.
|
||
|
||
### Phase 1: Extract `crawler_core/`
|
||
|
||
1. Create `crawler_core/__init__.py`, `crawler_core/http_client.py`, `crawler_core/base.py`
|
||
- These are identical to the current `spiderJobs/core/` files with one import path change
|
||
- Zero functional change; pure extraction
|
||
2. Create `crawler_core/boss/`, `crawler_core/qcwy/`, `crawler_core/zhilian/`
|
||
- Extract from `spiderJobs/platforms/boss/`, qcwy/, zhilian/
|
||
- Update internal imports to `from crawler_core.boss.sign import BossSign`, etc.
|
||
3. Add `crawler_core` to `Pipfile` as an editable local dependency:
|
||
```toml
|
||
[packages]
|
||
crawler_core = {path = ".", editable = true}
|
||
```
|
||
(Or configure `pyproject.toml` with `[tool.setuptools.packages.find]` including `crawler_core`)
|
||
|
||
**Verification gate:** `spiderJobs/` and `app/services/crawler/` both import from `crawler_core` without errors. `spiderJobs/core/` and `spiderJobs/platforms/` can be deleted.
|
||
|
||
### Phase 2: Rewire `app/services/crawler/` Facades
|
||
|
||
The service facades (`boss.py`, `qcwy.py`, `zhilian.py`) stay in `app/services/crawler/` — they belong to the backend. Their job is to:
|
||
- Load tokens from MySQL (`BossToken`)
|
||
- Load proxies from `ProxyController`
|
||
- Instantiate `crawler_core` clients with those credentials
|
||
- Expose typed async-friendly methods (`get_company_detail_by_id`)
|
||
- Wrap `asyncio.to_thread()` for each sync call
|
||
|
||
The private underscore modules (`_boss_client.py`, `_boss_sign.py`, etc.) are deleted. Facades import directly from `crawler_core`.
|
||
|
||
```python
|
||
# app/services/crawler/boss.py (after refactor)
|
||
from crawler_core.boss.client import BossClient, create_client
|
||
from crawler_core.boss.sign import BossSign
|
||
from crawler_core.boss.api import GetBrandDetail, SearchBrandJobs
|
||
```
|
||
|
||
**Verification gate:** `CompanyCleaner` test passes with mocked `crawler_core` calls.
|
||
|
||
### Phase 3: Rewire `spiderJobs/` External Scripts
|
||
|
||
`spiderJobs/runner/` imports `from spiderJobs.core.base import ...` → change to `from crawler_core.base import ...`. Delete `spiderJobs/core/` and `spiderJobs/platforms/`.
|
||
|
||
`spiderJobs/platforms/boss/main.py` becomes a thin entry point:
|
||
```python
|
||
from crawler_core.boss.client import create_client
|
||
from crawler_core.boss.api import SearchRecJobs
|
||
from spiderJobs.runner.loop import run_crawl_loop
|
||
```
|
||
|
||
`jobs_spider/` (the older framework with monolithic files) is deprecated in favor of `spiderJobs/` + `crawler_core/`.
|
||
|
||
---
|
||
|
||
## Anti-Patterns to Avoid
|
||
|
||
### Anti-Pattern 1: Async Crawler Core
|
||
**What:** Making `crawler_core/` async because the backend is async.
|
||
**Why bad:** `requests-go` is sync. The TLS fingerprint mechanism is sync. Wrapping everything in `asyncio.run()` or creating fake async wrappers adds complexity with no benefit. The `asyncio.to_thread()` boundary at the facade layer is the correct and idiomatic solution.
|
||
**Instead:** Keep `crawler_core/` synchronous. Use `asyncio.to_thread()` in the backend facades.
|
||
|
||
### Anti-Pattern 2: Putting `RunnerAPIClient` in `crawler_core/`
|
||
**What:** Moving the HTTP push client (backend API caller) into the shared core.
|
||
**Why bad:** `RunnerAPIClient` is deployment-specific: it knows about `/api/v1/keyword/available`, `/api/v1/universal/data/batch-store-async`, and the backend's URL scheme. That is runner orchestration logic, not platform API logic.
|
||
**Instead:** `RunnerAPIClient` stays in `spiderJobs/runner/api_client.py`. It is not shared.
|
||
|
||
### Anti-Pattern 3: Importing `app.*` from External Scripts
|
||
**What:** Having ECS scripts import from `app.settings.config`, `app.models`, or `app.core.*`.
|
||
**Why bad:** That would require Tortoise-ORM initialization, database connections, and all backend startup logic to run on ECS nodes. This would fail.
|
||
**Instead:** External scripts get all config from environment variables. They communicate with the backend only via HTTP.
|
||
|
||
### Anti-Pattern 4: Blocking the Event Loop with Sync Crawler Calls
|
||
**What:** Calling `boss_service.get_company_detail_by_id()` directly in an `async def` without `asyncio.to_thread()`.
|
||
**Why bad:** Each HTTP call takes 2-10 seconds. With 30 companies to process, that blocks the event loop for up to 5 minutes. Other requests, heartbeats, and scheduler jobs starve.
|
||
**Instead:** Always wrap sync crawler calls in `asyncio.to_thread()` inside async context.
|
||
|
||
### Anti-Pattern 5: Monolithic Platform Files
|
||
**What:** Keeping the existing `jobs_spider/boss/boos_api.py` (2246 lines) pattern in the refactored code.
|
||
**Why bad:** Impossible to test individual components. Signing, HTTP, and business logic are entangled.
|
||
**Instead:** Three files per platform: `sign.py` (pure functions, no I/O), `client.py` (HTTP wrapper, no business logic), `api.py` (endpoint classes).
|
||
|
||
---
|
||
|
||
## Scalability Considerations
|
||
|
||
| Concern | Current | After Refactor |
|
||
|---------|---------|---------------|
|
||
| Adding a 4th platform (e.g., job51) | Copy files into both `spiderJobs/` and `app/services/crawler/` | Add `crawler_core/job51/` once; both consumers get it |
|
||
| Testing signing algorithm | Must run full crawler | Test `crawler_core/boss/sign.py` in isolation — pure functions, no I/O |
|
||
| Updating User-Agent headers | Two files to update | One file: `crawler_core/boss/client.py` |
|
||
| Running multiple concurrent companies | Single-threaded | `asyncio.to_thread()` uses thread pool; default pool size = CPU count × 5 |
|
||
| ECS deployment of crawlers | Must deploy `spiderJobs/` + copy of core | Deploy `crawler_core/` + `spiderJobs/`; single source of truth |
|
||
|
||
---
|
||
|
||
## Suggested Build Order for Roadmap Phases
|
||
|
||
Dependencies flow in one direction:
|
||
|
||
```
|
||
Phase 1: crawler_core/ extraction
|
||
└─ Produces: installable shared package, 3-platform coverage
|
||
|
||
Phase 2: app/services/crawler/ facade rewire
|
||
└─ Depends on: Phase 1 (imports crawler_core)
|
||
└─ Produces: asyncio.to_thread() boundary, deleted underscore copies
|
||
|
||
Phase 3: spiderJobs/ rewire + jobs_spider/ deprecation
|
||
└─ Depends on: Phase 1 (imports crawler_core)
|
||
└─ Produces: single external runner codebase
|
||
|
||
Phase 4: Company cleaning flow optimization
|
||
└─ Depends on: Phase 2 (async-safe facades)
|
||
└─ Produces: CompanyJobsSyncService using new facades
|
||
|
||
Phase 5: Unit tests
|
||
└─ Depends on: Phase 1 (crawler_core is testable in isolation)
|
||
└─ crawler_core tests: no mocking needed for sign.py (pure functions)
|
||
└─ facade tests: mock crawler_core, test async wrapping
|
||
```
|
||
|
||
Phases 2 and 3 can run in parallel (both depend only on Phase 1, not each other).
|
||
|
||
---
|
||
|
||
## Sources
|
||
|
||
- Direct inspection: `/Users/win/2025/AICoding/JobData/app/services/crawler/_http_client.py` (docstring confirms copy from `spiderJobs/core/http_client.py`)
|
||
- Direct inspection: `/Users/win/2025/AICoding/JobData/spiderJobs/core/base.py` vs `app/services/crawler/_base.py` — byte-identical logic
|
||
- Direct inspection: `/Users/win/2025/AICoding/JobData/app/services/company_cleaner.py` — async consumer of sync crawlers
|
||
- Direct inspection: `/Users/win/2025/AICoding/JobData/spiderJobs/runner/loop.py` — sync-only external runner
|
||
- Direct inspection: `/Users/win/2025/AICoding/JobData/.planning/codebase/ARCHITECTURE.md` — existing architecture analysis
|
||
|
||
---
|
||
|
||
*Architecture analysis: 2026-03-21*
|