JobData/.planning/research/ARCHITECTURE.md

424 lines
20 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Architecture Patterns: Shared Crawler Core
**Domain:** Recruiting data crawler — dual-deployment shared core
**Researched:** 2026-03-21
**Overall confidence:** HIGH (based entirely on direct codebase inspection)
---
## Problem Statement
The codebase currently has the same code in two places, copied verbatim:
| File | Copy location |
|------|---------------|
| `spiderJobs/core/http_client.py` | `app/services/crawler/_http_client.py` |
| `spiderJobs/core/base.py` | `app/services/crawler/_base.py` |
| `spiderJobs/platforms/boss/sign.py` | `app/services/crawler/_boss_sign.py` |
| `spiderJobs/platforms/boss/client.py` | `app/services/crawler/_boss_client.py` |
| `spiderJobs/platforms/boss/api.py` | `app/services/crawler/_boss_api.py` |
The docstrings in `app/services/crawler/_http_client.py` literally say: "复制自 spiderJobs/core/http_client.py — 不要直接 import spiderJobs避免跨模块依赖" ("copied from — do not import directly to avoid cross-module dependency"). The comment reveals the source of the problem: the team correctly identified that two separate deployment units cannot cross-import, but solved it by copy-paste rather than extraction.
There are also two competing external spider frameworks: `jobs_spider/` (older, monolithic, single large file per platform) and `spiderJobs/` (newer, modular, uses `runner/loop.py` + `runner/api_client.py` as a proper execution harness). The refactor must unify these.
---
## Recommended Architecture
### Core Design Decision: Installable Shared Package
Extract shared crawler code into a proper Python package at `crawler_core/` (project root level). Both deployment contexts import from it. No copying. No circular imports.
```
crawler_core/ # Shared library — zero app/* or spiderJobs/* imports
├── __init__.py
├── http_client.py # HTTPClient (requests-go, TLS fingerprint, proxy rotation)
├── base.py # ApiResult, BaseFetcher, BaseSearcher, parse_response()
├── boss/
│ ├── __init__.py
│ ├── sign.py # BossSign (Traceid generation algorithm)
│ ├── client.py # BossClient(HTTPClient) — header injection
│ └── api.py # SearchRecJobs, GetBrandDetail, etc.
├── qcwy/
│ ├── __init__.py
│ ├── sign.py
│ ├── client.py
│ └── api.py
└── zhilian/
├── __init__.py
├── sign.py
├── client.py
└── api.py
```
**Why a root-level package and not inside `app/` or `spiderJobs/`:**
- `app/` is the FastAPI backend. Importing it from external scripts would pull in Tortoise-ORM, APScheduler, ClickHouse, and every other backend dependency — making external scripts impossibly heavy.
- `spiderJobs/` is a deployment directory, not a library.
- A root-level `crawler_core/` has no such baggage. It imports only `requests-go` and stdlib. Both `app/` and `spiderJobs/` import from it.
**Constraints satisfied:**
- No circular imports
- No shared runtime state (crawler_core is stateless pure logic)
- Installable as local package (editable install via `pip install -e .` or `pipenv install -e .`) for ECS deployment
- Tests can import `crawler_core` directly without mocking a FastAPI app
---
## Component Boundaries
### What Talks to What
```
┌─────────────────────────────────────────────────────────────┐
│ crawler_core/ │
│ HTTPClient · BaseFetcher · BaseSearcher · ApiResult │
│ boss/ · qcwy/ · zhilian/ (sign + client + api) │
│ Imports: requests-go, stdlib only │
└────────────────┬────────────────────────┬───────────────────┘
│ │
┌────────────▼──────────┐ ┌─────────▼─────────────────┐
│ app/services/ │ │ spiderJobs/ │
│ crawler/ │ │ platforms/ + runner/ │
│ │ │ │
│ BossService facade │ │ run_crawl_loop() │
│ QcwyService facade │ │ RunnerAPIClient │
│ ZhilianService facade│ │ (HTTP push to backend) │
│ │ │ │
│ Called by: │ │ Called by: │
│ CompanyCleaner │ │ ECS node scripts │
│ CompanyController │ │ jobs_spider/ (deprecated) │
└────────────┬──────────┘ └────────────────────────────┘
┌────────────▼──────────────────────────────────────────┐
│ app/ (FastAPI backend — async boundary) │
│ CompanyCleaner (asyncio, calls sync crawler via │
│ asyncio.to_thread()) │
│ IngestService (async ClickHouse writes) │
│ APScheduler (triggers CompanyCleaner async jobs) │
└───────────────────────────────────────────────────────┘
```
### Component Responsibilities
| Component | Responsibility | What It Must NOT Do |
|-----------|---------------|---------------------|
| `crawler_core/` | Signing, HTTP requests, response parsing — per platform | Import `app.*`, use async, maintain global state |
| `app/services/crawler/` facades | Wrap `crawler_core` with logging, token loading, proxy mgmt | Contain signing or HTTP logic |
| `spiderJobs/runner/` | Crawl loop orchestration, keyword polling, data upload | Import `app.*` |
| `app/services/ingest/` | Registry-driven ClickHouse insertion, dedup | Know anything about platform HTTP clients |
| `app/core/scheduler.py` | Schedule jobs, manage distributed lock | Run synchronous I/O on the event loop directly |
---
## Abstract Base Class Hierarchy
```python
# crawler_core/base.py
@dataclass
class ApiResult:
success: bool
status_code: int
data: Any = None
list: list[dict] = field(default_factory=list)
count: int = 0
is_end_page: bool = True
error: Optional[str] = None
class BaseFetcher:
"""Single-resource GET (company detail, job detail)."""
ENDPOINT: str = ""
def __init__(self, http_client: HTTPClient): ...
def _build_params(self) -> dict: raise NotImplementedError
def _parse(self, http_code: int, raw: Any) -> ApiResult: ...
def fetch(self) -> ApiResult: ... # sync — wraps _http.get()
class BaseSearcher:
"""Paginated search (job list, brand job list)."""
ENDPOINT: str = ""
def __init__(self, page_size: int, http_client: HTTPClient): ...
def _build_params(self, page_index: int) -> dict: raise NotImplementedError
def _request(self, params: dict) -> tuple[int, Any]: ...
def _parse(self, http_code: int, raw: Any) -> ApiResult: ...
def search(self, page_index: int = 1) -> ApiResult: ... # sync
def load_all(self, max_pages: int, on_page: Callable | None) -> list[dict]: ...
```
Platform clients subclass `HTTPClient` from `crawler_core/http_client.py`:
```python
# crawler_core/boss/client.py
class BossClient(HTTPClient):
"""Injects mpt/wt2/Traceid headers on every request."""
def __init__(self, signer: BossSign, ...): ...
def post(self, path, body, headers=None) -> tuple[int, Any]: ... # sync
def get(self, path, params=None, headers=None) -> tuple[int, Any]: ... # sync
```
Platform API classes subclass `BaseFetcher` / `BaseSearcher`:
```python
# crawler_core/boss/api.py
class SearchRecJobs(BaseSearcher):
ENDPOINT = "/wapi/zpgeek/miniapp/homepage/recjoblist.json"
def _build_params(self, page_index: int) -> dict: ...
def _parse(self, http_code: int, raw: Any) -> ApiResult: ... # Boss-specific parsing
class GetBrandDetail(BaseFetcher):
ENDPOINT = "/wapi/zpgeek/miniapp/brand/get.json"
def _build_params(self) -> dict: ...
```
**The hierarchy is deliberately flat.** Three levels: `HTTPClient``PlatformClient``PlatformAPIClass`. No further abstraction.
---
## Sync vs Async: The Critical Design Point
### Why the Core Must Stay Synchronous
`crawler_core/` uses `requests-go` (a sync library that wraps a TLS fingerprinting mechanism). There is no async version of `requests-go`. The platform APIs require TLS fingerprint spoofing that `httpx` does not support without significant additional work. Keeping the core synchronous is the correct choice given the constraint.
### How the Backend (async) Calls Synchronous Crawlers
`CompanyCleaner` runs inside the FastAPI async event loop (called by APScheduler). It cannot block the event loop with synchronous `requests` calls. The correct pattern is `asyncio.to_thread()`:
```python
# app/services/company_cleaner.py
async def _fetch_company_detail(self, source: str, company_id: str) -> Optional[dict]:
"""Run synchronous crawler call off the event loop thread."""
if source == "boss":
return await asyncio.to_thread(
self.boss_service.get_company_detail_by_id, company_id
)
elif source == "qcwy":
return await asyncio.to_thread(
self.qcwy_service.get_company_detail_by_id, company_id
)
elif source == "zhilian":
return await asyncio.to_thread(
self.zhilian_service.get_company_detail_by_id, company_id
)
```
`asyncio.to_thread()` runs the synchronous function in a thread pool, yielding back to the event loop while it executes. This is the standard Python pattern for sync I/O in an async context.
**Current state:** The existing `CompanyCleaner.process_pending_companies()` likely calls sync methods directly on the event loop — this is a bug to fix during refactoring. The fix is wrapping each crawler call with `asyncio.to_thread()`.
### External Scripts: Pure Synchronous, No Async
`spiderJobs/runner/loop.py` (`run_crawl_loop`) and any ECS scripts are pure synchronous `while True` loops. They import from `crawler_core/` directly. No async. This is correct and intentional.
```
External Script (sync) Backend (async)
────────────────────── ─────────────────────────────
while True: async def process_company():
kw = api.fetch_keyword() detail = await asyncio.to_thread(
searcher = BossSearcher(...) boss_service.get_company_detail
result = searcher.search() )
api.upload_data(result.list) await company_storage.save(detail)
```
---
## Data Flow Patterns
### Flow 1: External Spider → Backend Storage
```
ECS Script (spiderJobs/runner/loop.py)
├─ GET /api/v1/keyword/available
│ → KeywordController marks keyword as "crawling"
├─ crawler_core/boss/api.py SearchRecJobs.search(page)
│ → returns ApiResult.list (raw job dicts)
├─ POST /api/v1/universal/data/batch-store-async
│ body: {platform, channel, data_type, data_list}
└─ Backend: IngestService.store_batch()
→ registry lookup → dedup → ClickHouse insert
```
The external script knows nothing about ClickHouse. It only pushes raw dicts over HTTP. This boundary is correct and must stay.
### Flow 2: In-Process Company Cleaning (Scheduled)
```
APScheduler → company_cleaning_job() [async]
├─ CompanyCleaner.collect_pending_companies() [async]
│ → ClickHouse: query job tables for company IDs
│ → MySQL: check which company IDs already processed
│ → MySQL: insert new companies into CompanyCleaningQueue
└─ CompanyCleaner.process_pending_companies() [async]
├─ MySQL: fetch N companies from queue
├─ asyncio.to_thread(boss_service.get_company_detail_by_id)
│ → crawler_core/boss/api.py GetBrandDetail.fetch() [sync]
├─ company_storage.save_company() [async, ClickHouse]
└─ MySQL: update company status to "done"
```
The flow crosses the sync/async boundary exactly once: at `asyncio.to_thread()`. Everything above (ClickHouse queries, MySQL operations) stays async. Everything below (HTTP to Boss API) stays sync.
### Flow 3: Company Jobs Sync (new, post-refactor)
After a company is cleaned, the system should backfill that company's current job listings:
```
CompanyJobsSyncService.sync_company_jobs(company_id, source) [async]
├─ asyncio.to_thread(boss_service.get_company_jobs_by_id, company_id)
│ → crawler_core/boss/api.py SearchBrandJobs.search() [sync]
└─ IngestService.store_batch(platform="boss", data_type="company_job")
→ ClickHouse insert into boss_company_job table (new table needed)
```
---
## Package Structure: Suggested Build Order
Build `crawler_core/` first — everything else depends on it.
### Phase 1: Extract `crawler_core/`
1. Create `crawler_core/__init__.py`, `crawler_core/http_client.py`, `crawler_core/base.py`
- These are identical to the current `spiderJobs/core/` files with one import path change
- Zero functional change; pure extraction
2. Create `crawler_core/boss/`, `crawler_core/qcwy/`, `crawler_core/zhilian/`
- Extract from `spiderJobs/platforms/boss/`, qcwy/, zhilian/
- Update internal imports to `from crawler_core.boss.sign import BossSign`, etc.
3. Add `crawler_core` to `Pipfile` as an editable local dependency:
```toml
[packages]
crawler_core = {path = ".", editable = true}
```
(Or configure `pyproject.toml` with `[tool.setuptools.packages.find]` including `crawler_core`)
**Verification gate:** `spiderJobs/` and `app/services/crawler/` both import from `crawler_core` without errors. `spiderJobs/core/` and `spiderJobs/platforms/` can be deleted.
### Phase 2: Rewire `app/services/crawler/` Facades
The service facades (`boss.py`, `qcwy.py`, `zhilian.py`) stay in `app/services/crawler/` — they belong to the backend. Their job is to:
- Load tokens from MySQL (`BossToken`)
- Load proxies from `ProxyController`
- Instantiate `crawler_core` clients with those credentials
- Expose typed async-friendly methods (`get_company_detail_by_id`)
- Wrap `asyncio.to_thread()` for each sync call
The private underscore modules (`_boss_client.py`, `_boss_sign.py`, etc.) are deleted. Facades import directly from `crawler_core`.
```python
# app/services/crawler/boss.py (after refactor)
from crawler_core.boss.client import BossClient, create_client
from crawler_core.boss.sign import BossSign
from crawler_core.boss.api import GetBrandDetail, SearchBrandJobs
```
**Verification gate:** `CompanyCleaner` test passes with mocked `crawler_core` calls.
### Phase 3: Rewire `spiderJobs/` External Scripts
`spiderJobs/runner/` imports `from spiderJobs.core.base import ...` → change to `from crawler_core.base import ...`. Delete `spiderJobs/core/` and `spiderJobs/platforms/`.
`spiderJobs/platforms/boss/main.py` becomes a thin entry point:
```python
from crawler_core.boss.client import create_client
from crawler_core.boss.api import SearchRecJobs
from spiderJobs.runner.loop import run_crawl_loop
```
`jobs_spider/` (the older framework with monolithic files) is deprecated in favor of `spiderJobs/` + `crawler_core/`.
---
## Anti-Patterns to Avoid
### Anti-Pattern 1: Async Crawler Core
**What:** Making `crawler_core/` async because the backend is async.
**Why bad:** `requests-go` is sync. The TLS fingerprint mechanism is sync. Wrapping everything in `asyncio.run()` or creating fake async wrappers adds complexity with no benefit. The `asyncio.to_thread()` boundary at the facade layer is the correct and idiomatic solution.
**Instead:** Keep `crawler_core/` synchronous. Use `asyncio.to_thread()` in the backend facades.
### Anti-Pattern 2: Putting `RunnerAPIClient` in `crawler_core/`
**What:** Moving the HTTP push client (backend API caller) into the shared core.
**Why bad:** `RunnerAPIClient` is deployment-specific: it knows about `/api/v1/keyword/available`, `/api/v1/universal/data/batch-store-async`, and the backend's URL scheme. That is runner orchestration logic, not platform API logic.
**Instead:** `RunnerAPIClient` stays in `spiderJobs/runner/api_client.py`. It is not shared.
### Anti-Pattern 3: Importing `app.*` from External Scripts
**What:** Having ECS scripts import from `app.settings.config`, `app.models`, or `app.core.*`.
**Why bad:** That would require Tortoise-ORM initialization, database connections, and all backend startup logic to run on ECS nodes. This would fail.
**Instead:** External scripts get all config from environment variables. They communicate with the backend only via HTTP.
### Anti-Pattern 4: Blocking the Event Loop with Sync Crawler Calls
**What:** Calling `boss_service.get_company_detail_by_id()` directly in an `async def` without `asyncio.to_thread()`.
**Why bad:** Each HTTP call takes 2-10 seconds. With 30 companies to process, that blocks the event loop for up to 5 minutes. Other requests, heartbeats, and scheduler jobs starve.
**Instead:** Always wrap sync crawler calls in `asyncio.to_thread()` inside async context.
### Anti-Pattern 5: Monolithic Platform Files
**What:** Keeping the existing `jobs_spider/boss/boos_api.py` (2246 lines) pattern in the refactored code.
**Why bad:** Impossible to test individual components. Signing, HTTP, and business logic are entangled.
**Instead:** Three files per platform: `sign.py` (pure functions, no I/O), `client.py` (HTTP wrapper, no business logic), `api.py` (endpoint classes).
---
## Scalability Considerations
| Concern | Current | After Refactor |
|---------|---------|---------------|
| Adding a 4th platform (e.g., job51) | Copy files into both `spiderJobs/` and `app/services/crawler/` | Add `crawler_core/job51/` once; both consumers get it |
| Testing signing algorithm | Must run full crawler | Test `crawler_core/boss/sign.py` in isolation — pure functions, no I/O |
| Updating User-Agent headers | Two files to update | One file: `crawler_core/boss/client.py` |
| Running multiple concurrent companies | Single-threaded | `asyncio.to_thread()` uses thread pool; default pool size = CPU count × 5 |
| ECS deployment of crawlers | Must deploy `spiderJobs/` + copy of core | Deploy `crawler_core/` + `spiderJobs/`; single source of truth |
---
## Suggested Build Order for Roadmap Phases
Dependencies flow in one direction:
```
Phase 1: crawler_core/ extraction
└─ Produces: installable shared package, 3-platform coverage
Phase 2: app/services/crawler/ facade rewire
└─ Depends on: Phase 1 (imports crawler_core)
└─ Produces: asyncio.to_thread() boundary, deleted underscore copies
Phase 3: spiderJobs/ rewire + jobs_spider/ deprecation
└─ Depends on: Phase 1 (imports crawler_core)
└─ Produces: single external runner codebase
Phase 4: Company cleaning flow optimization
└─ Depends on: Phase 2 (async-safe facades)
└─ Produces: CompanyJobsSyncService using new facades
Phase 5: Unit tests
└─ Depends on: Phase 1 (crawler_core is testable in isolation)
└─ crawler_core tests: no mocking needed for sign.py (pure functions)
└─ facade tests: mock crawler_core, test async wrapping
```
Phases 2 and 3 can run in parallel (both depend only on Phase 1, not each other).
---
## Sources
- Direct inspection: `/Users/win/2025/AICoding/JobData/app/services/crawler/_http_client.py` (docstring confirms copy from `spiderJobs/core/http_client.py`)
- Direct inspection: `/Users/win/2025/AICoding/JobData/spiderJobs/core/base.py` vs `app/services/crawler/_base.py` — byte-identical logic
- Direct inspection: `/Users/win/2025/AICoding/JobData/app/services/company_cleaner.py` — async consumer of sync crawlers
- Direct inspection: `/Users/win/2025/AICoding/JobData/spiderJobs/runner/loop.py` — sync-only external runner
- Direct inspection: `/Users/win/2025/AICoding/JobData/.planning/codebase/ARCHITECTURE.md` — existing architecture analysis
---
*Architecture analysis: 2026-03-21*