JobData/.planning/research/ARCHITECTURE.md

# Architecture Patterns: Shared Crawler Core

**Domain:** Recruiting data crawler — dual-deployment shared core
**Researched:** 2026-03-21
**Overall confidence:** HIGH (based entirely on direct codebase inspection)

---

## Problem Statement

The codebase currently has the same code in two places, copied verbatim:

| File | Copy location |
|------|---------------|
| `spiderJobs/core/http_client.py` | `app/services/crawler/_http_client.py` |
| `spiderJobs/core/base.py` | `app/services/crawler/_base.py` |
| `spiderJobs/platforms/boss/sign.py` | `app/services/crawler/_boss_sign.py` |
| `spiderJobs/platforms/boss/client.py` | `app/services/crawler/_boss_client.py` |
| `spiderJobs/platforms/boss/api.py` | `app/services/crawler/_boss_api.py` |

The docstrings in `app/services/crawler/_http_client.py` literally say: "复制自 spiderJobs/core/http_client.py — 不要直接 import spiderJobs，避免跨模块依赖" ("copied from — do not import directly to avoid cross-module dependency"). The comment reveals the source of the problem: the team correctly identified that two separate deployment units cannot cross-import, but solved it by copy-paste rather than extraction.

There are also two competing external spider frameworks: `jobs_spider/` (older, monolithic, single large file per platform) and `spiderJobs/` (newer, modular, uses `runner/loop.py` + `runner/api_client.py` as a proper execution harness). The refactor must unify these.

---

## Recommended Architecture

### Core Design Decision: Installable Shared Package

Extract shared crawler code into a proper Python package at `crawler_core/` (project root level). Both deployment contexts import from it. No copying. No circular imports.

```
crawler_core/                     # Shared library — zero app/* or spiderJobs/* imports
├── __init__.py
├── http_client.py                # HTTPClient (requests-go, TLS fingerprint, proxy rotation)
├── base.py                       # ApiResult, BaseFetcher, BaseSearcher, parse_response()
├── boss/
│   ├── __init__.py
│   ├── sign.py                   # BossSign (Traceid generation algorithm)
│   ├── client.py                 # BossClient(HTTPClient) — header injection
│   └── api.py                    # SearchRecJobs, GetBrandDetail, etc.
├── qcwy/
│   ├── __init__.py
│   ├── sign.py
│   ├── client.py
│   └── api.py
└── zhilian/
    ├── __init__.py
    ├── sign.py
    ├── client.py
    └── api.py
```

**Why a root-level package and not inside `app/` or `spiderJobs/`:**
- `app/` is the FastAPI backend. Importing it from external scripts would pull in Tortoise-ORM, APScheduler, ClickHouse, and every other backend dependency — making external scripts impossibly heavy.
- `spiderJobs/` is a deployment directory, not a library.
- A root-level `crawler_core/` has no such baggage. It imports only `requests-go` and stdlib. Both `app/` and `spiderJobs/` import from it.

**Constraints satisfied:**
- No circular imports
- No shared runtime state (crawler_core is stateless pure logic)
- Installable as local package (editable install via `pip install -e .` or `pipenv install -e .`) for ECS deployment
- Tests can import `crawler_core` directly without mocking a FastAPI app

---

## Component Boundaries

### What Talks to What

```
┌─────────────────────────────────────────────────────────────┐
│                      crawler_core/                          │
│   HTTPClient · BaseFetcher · BaseSearcher · ApiResult       │
│   boss/ · qcwy/ · zhilian/  (sign + client + api)          │
│   Imports: requests-go, stdlib only                         │
└────────────────┬────────────────────────┬───────────────────┘
                 │                        │
    ┌────────────▼──────────┐  ┌─────────▼─────────────────┐
    │  app/services/        │  │  spiderJobs/               │
    │  crawler/             │  │  platforms/ + runner/      │
    │                       │  │                            │
    │  BossService facade   │  │  run_crawl_loop()          │
    │  QcwyService facade   │  │  RunnerAPIClient           │
    │  ZhilianService facade│  │  (HTTP push to backend)    │
    │                       │  │                            │
    │  Called by:           │  │  Called by:                │
    │  CompanyCleaner       │  │  ECS node scripts          │
    │  CompanyController    │  │  jobs_spider/ (deprecated) │
    └────────────┬──────────┘  └────────────────────────────┘
                 │
    ┌────────────▼──────────────────────────────────────────┐
    │  app/  (FastAPI backend — async boundary)             │
    │  CompanyCleaner (asyncio, calls sync crawler via      │
    │  asyncio.to_thread())                                 │
    │  IngestService (async ClickHouse writes)              │
    │  APScheduler (triggers CompanyCleaner async jobs)     │
    └───────────────────────────────────────────────────────┘
```

### Component Responsibilities

| Component | Responsibility | What It Must NOT Do |
|-----------|---------------|---------------------|
| `crawler_core/` | Signing, HTTP requests, response parsing — per platform | Import `app.*`, use async, maintain global state |
| `app/services/crawler/` facades | Wrap `crawler_core` with logging, token loading, proxy mgmt | Contain signing or HTTP logic |
| `spiderJobs/runner/` | Crawl loop orchestration, keyword polling, data upload | Import `app.*` |
| `app/services/ingest/` | Registry-driven ClickHouse insertion, dedup | Know anything about platform HTTP clients |
| `app/core/scheduler.py` | Schedule jobs, manage distributed lock | Run synchronous I/O on the event loop directly |

---

## Abstract Base Class Hierarchy

```python
# crawler_core/base.py

@dataclass
class ApiResult:
    success: bool
    status_code: int
    data: Any = None
    list: list[dict] = field(default_factory=list)
    count: int = 0
    is_end_page: bool = True
    error: Optional[str] = None


class BaseFetcher:
    """Single-resource GET (company detail, job detail)."""
    ENDPOINT: str = ""

    def __init__(self, http_client: HTTPClient): ...
    def _build_params(self) -> dict: raise NotImplementedError
    def _parse(self, http_code: int, raw: Any) -> ApiResult: ...
    def fetch(self) -> ApiResult: ...          # sync — wraps _http.get()


class BaseSearcher:
    """Paginated search (job list, brand job list)."""
    ENDPOINT: str = ""

    def __init__(self, page_size: int, http_client: HTTPClient): ...
    def _build_params(self, page_index: int) -> dict: raise NotImplementedError
    def _request(self, params: dict) -> tuple[int, Any]: ...
    def _parse(self, http_code: int, raw: Any) -> ApiResult: ...
    def search(self, page_index: int = 1) -> ApiResult: ...  # sync
    def load_all(self, max_pages: int, on_page: Callable | None) -> list[dict]: ...
```

Platform clients subclass `HTTPClient` from `crawler_core/http_client.py`:

```python
# crawler_core/boss/client.py
class BossClient(HTTPClient):
    """Injects mpt/wt2/Traceid headers on every request."""
    def __init__(self, signer: BossSign, ...): ...
    def post(self, path, body, headers=None) -> tuple[int, Any]: ...  # sync
    def get(self, path, params=None, headers=None) -> tuple[int, Any]: ...  # sync
```

Platform API classes subclass `BaseFetcher` / `BaseSearcher`:

```python
# crawler_core/boss/api.py
class SearchRecJobs(BaseSearcher):
    ENDPOINT = "/wapi/zpgeek/miniapp/homepage/recjoblist.json"
    def _build_params(self, page_index: int) -> dict: ...
    def _parse(self, http_code: int, raw: Any) -> ApiResult: ...  # Boss-specific parsing

class GetBrandDetail(BaseFetcher):
    ENDPOINT = "/wapi/zpgeek/miniapp/brand/get.json"
    def _build_params(self) -> dict: ...
```

**The hierarchy is deliberately flat.** Three levels: `HTTPClient` → `PlatformClient` → `PlatformAPIClass`. No further abstraction.

---

## Sync vs Async: The Critical Design Point

### Why the Core Must Stay Synchronous

`crawler_core/` uses `requests-go` (a sync library that wraps a TLS fingerprinting mechanism). There is no async version of `requests-go`. The platform APIs require TLS fingerprint spoofing that `httpx` does not support without significant additional work. Keeping the core synchronous is the correct choice given the constraint.

### How the Backend (async) Calls Synchronous Crawlers

`CompanyCleaner` runs inside the FastAPI async event loop (called by APScheduler). It cannot block the event loop with synchronous `requests` calls. The correct pattern is `asyncio.to_thread()`:

```python
# app/services/company_cleaner.py

async def _fetch_company_detail(self, source: str, company_id: str) -> Optional[dict]:
    """Run synchronous crawler call off the event loop thread."""
    if source == "boss":
        return await asyncio.to_thread(
            self.boss_service.get_company_detail_by_id, company_id
        )
    elif source == "qcwy":
        return await asyncio.to_thread(
            self.qcwy_service.get_company_detail_by_id, company_id
        )
    elif source == "zhilian":
        return await asyncio.to_thread(
            self.zhilian_service.get_company_detail_by_id, company_id
        )
```

`asyncio.to_thread()` runs the synchronous function in a thread pool, yielding back to the event loop while it executes. This is the standard Python pattern for sync I/O in an async context.

**Current state:** The existing `CompanyCleaner.process_pending_companies()` likely calls sync methods directly on the event loop — this is a bug to fix during refactoring. The fix is wrapping each crawler call with `asyncio.to_thread()`.

### External Scripts: Pure Synchronous, No Async

`spiderJobs/runner/loop.py` (`run_crawl_loop`) and any ECS scripts are pure synchronous `while True` loops. They import from `crawler_core/` directly. No async. This is correct and intentional.

```
External Script (sync)          Backend (async)
──────────────────────          ─────────────────────────────
while True:                     async def process_company():
  kw = api.fetch_keyword()        detail = await asyncio.to_thread(
  searcher = BossSearcher(...)        boss_service.get_company_detail
  result = searcher.search()      )
  api.upload_data(result.list)    await company_storage.save(detail)
```

---

## Data Flow Patterns

### Flow 1: External Spider → Backend Storage

```
ECS Script (spiderJobs/runner/loop.py)
  │
  ├─ GET /api/v1/keyword/available
  │    → KeywordController marks keyword as "crawling"
  │
  ├─ crawler_core/boss/api.py SearchRecJobs.search(page)
  │    → returns ApiResult.list (raw job dicts)
  │
  ├─ POST /api/v1/universal/data/batch-store-async
  │    body: {platform, channel, data_type, data_list}
  │
  └─ Backend: IngestService.store_batch()
       → registry lookup → dedup → ClickHouse insert
```

The external script knows nothing about ClickHouse. It only pushes raw dicts over HTTP. This boundary is correct and must stay.

### Flow 2: In-Process Company Cleaning (Scheduled)

```
APScheduler → company_cleaning_job() [async]
  │
  ├─ CompanyCleaner.collect_pending_companies()  [async]
  │    → ClickHouse: query job tables for company IDs
  │    → MySQL: check which company IDs already processed
  │    → MySQL: insert new companies into CompanyCleaningQueue
  │
  └─ CompanyCleaner.process_pending_companies()  [async]
       ├─ MySQL: fetch N companies from queue
       ├─ asyncio.to_thread(boss_service.get_company_detail_by_id)
       │    → crawler_core/boss/api.py GetBrandDetail.fetch() [sync]
       ├─ company_storage.save_company()          [async, ClickHouse]
       └─ MySQL: update company status to "done"
```

The flow crosses the sync/async boundary exactly once: at `asyncio.to_thread()`. Everything above (ClickHouse queries, MySQL operations) stays async. Everything below (HTTP to Boss API) stays sync.

### Flow 3: Company Jobs Sync (new, post-refactor)

After a company is cleaned, the system should backfill that company's current job listings:

```
CompanyJobsSyncService.sync_company_jobs(company_id, source) [async]
  ├─ asyncio.to_thread(boss_service.get_company_jobs_by_id, company_id)
  │    → crawler_core/boss/api.py SearchBrandJobs.search() [sync]
  └─ IngestService.store_batch(platform="boss", data_type="company_job")
       → ClickHouse insert into boss_company_job table (new table needed)
```

---

## Package Structure: Suggested Build Order

Build `crawler_core/` first — everything else depends on it.

### Phase 1: Extract `crawler_core/`

1. Create `crawler_core/__init__.py`, `crawler_core/http_client.py`, `crawler_core/base.py`
   - These are identical to the current `spiderJobs/core/` files with one import path change
   - Zero functional change; pure extraction
2. Create `crawler_core/boss/`, `crawler_core/qcwy/`, `crawler_core/zhilian/`
   - Extract from `spiderJobs/platforms/boss/`, qcwy/, zhilian/
   - Update internal imports to `from crawler_core.boss.sign import BossSign`, etc.
3. Add `crawler_core` to `Pipfile` as an editable local dependency:
   ```toml
   [packages]
   crawler_core = {path = ".", editable = true}
   ```
   (Or configure `pyproject.toml` with `[tool.setuptools.packages.find]` including `crawler_core`)

**Verification gate:** `spiderJobs/` and `app/services/crawler/` both import from `crawler_core` without errors. `spiderJobs/core/` and `spiderJobs/platforms/` can be deleted.

### Phase 2: Rewire `app/services/crawler/` Facades

The service facades (`boss.py`, `qcwy.py`, `zhilian.py`) stay in `app/services/crawler/` — they belong to the backend. Their job is to:
- Load tokens from MySQL (`BossToken`)
- Load proxies from `ProxyController`
- Instantiate `crawler_core` clients with those credentials
- Expose typed async-friendly methods (`get_company_detail_by_id`)
- Wrap `asyncio.to_thread()` for each sync call

The private underscore modules (`_boss_client.py`, `_boss_sign.py`, etc.) are deleted. Facades import directly from `crawler_core`.

```python
# app/services/crawler/boss.py (after refactor)
from crawler_core.boss.client import BossClient, create_client
from crawler_core.boss.sign import BossSign
from crawler_core.boss.api import GetBrandDetail, SearchBrandJobs
```

**Verification gate:** `CompanyCleaner` test passes with mocked `crawler_core` calls.

### Phase 3: Rewire `spiderJobs/` External Scripts

`spiderJobs/runner/` imports `from spiderJobs.core.base import ...` → change to `from crawler_core.base import ...`. Delete `spiderJobs/core/` and `spiderJobs/platforms/`.

`spiderJobs/platforms/boss/main.py` becomes a thin entry point:
```python
from crawler_core.boss.client import create_client
from crawler_core.boss.api import SearchRecJobs
from spiderJobs.runner.loop import run_crawl_loop
```

`jobs_spider/` (the older framework with monolithic files) is deprecated in favor of `spiderJobs/` + `crawler_core/`.

---

## Anti-Patterns to Avoid

### Anti-Pattern 1: Async Crawler Core
**What:** Making `crawler_core/` async because the backend is async.
**Why bad:** `requests-go` is sync. The TLS fingerprint mechanism is sync. Wrapping everything in `asyncio.run()` or creating fake async wrappers adds complexity with no benefit. The `asyncio.to_thread()` boundary at the facade layer is the correct and idiomatic solution.
**Instead:** Keep `crawler_core/` synchronous. Use `asyncio.to_thread()` in the backend facades.

### Anti-Pattern 2: Putting `RunnerAPIClient` in `crawler_core/`
**What:** Moving the HTTP push client (backend API caller) into the shared core.
**Why bad:** `RunnerAPIClient` is deployment-specific: it knows about `/api/v1/keyword/available`, `/api/v1/universal/data/batch-store-async`, and the backend's URL scheme. That is runner orchestration logic, not platform API logic.
**Instead:** `RunnerAPIClient` stays in `spiderJobs/runner/api_client.py`. It is not shared.

### Anti-Pattern 3: Importing `app.*` from External Scripts
**What:** Having ECS scripts import from `app.settings.config`, `app.models`, or `app.core.*`.
**Why bad:** That would require Tortoise-ORM initialization, database connections, and all backend startup logic to run on ECS nodes. This would fail.
**Instead:** External scripts get all config from environment variables. They communicate with the backend only via HTTP.

### Anti-Pattern 4: Blocking the Event Loop with Sync Crawler Calls
**What:** Calling `boss_service.get_company_detail_by_id()` directly in an `async def` without `asyncio.to_thread()`.
**Why bad:** Each HTTP call takes 2-10 seconds. With 30 companies to process, that blocks the event loop for up to 5 minutes. Other requests, heartbeats, and scheduler jobs starve.
**Instead:** Always wrap sync crawler calls in `asyncio.to_thread()` inside async context.

### Anti-Pattern 5: Monolithic Platform Files
**What:** Keeping the existing `jobs_spider/boss/boos_api.py` (2246 lines) pattern in the refactored code.
**Why bad:** Impossible to test individual components. Signing, HTTP, and business logic are entangled.
**Instead:** Three files per platform: `sign.py` (pure functions, no I/O), `client.py` (HTTP wrapper, no business logic), `api.py` (endpoint classes).

---

## Scalability Considerations

| Concern | Current | After Refactor |
|---------|---------|---------------|
| Adding a 4th platform (e.g., job51) | Copy files into both `spiderJobs/` and `app/services/crawler/` | Add `crawler_core/job51/` once; both consumers get it |
| Testing signing algorithm | Must run full crawler | Test `crawler_core/boss/sign.py` in isolation — pure functions, no I/O |
| Updating User-Agent headers | Two files to update | One file: `crawler_core/boss/client.py` |
| Running multiple concurrent companies | Single-threaded | `asyncio.to_thread()` uses thread pool; default pool size = CPU count × 5 |
| ECS deployment of crawlers | Must deploy `spiderJobs/` + copy of core | Deploy `crawler_core/` + `spiderJobs/`; single source of truth |

---

## Suggested Build Order for Roadmap Phases

Dependencies flow in one direction:

```
Phase 1: crawler_core/ extraction
  └─ Produces: installable shared package, 3-platform coverage

Phase 2: app/services/crawler/ facade rewire
  └─ Depends on: Phase 1 (imports crawler_core)
  └─ Produces: asyncio.to_thread() boundary, deleted underscore copies

Phase 3: spiderJobs/ rewire + jobs_spider/ deprecation
  └─ Depends on: Phase 1 (imports crawler_core)
  └─ Produces: single external runner codebase

Phase 4: Company cleaning flow optimization
  └─ Depends on: Phase 2 (async-safe facades)
  └─ Produces: CompanyJobsSyncService using new facades

Phase 5: Unit tests
  └─ Depends on: Phase 1 (crawler_core is testable in isolation)
  └─ crawler_core tests: no mocking needed for sign.py (pure functions)
  └─ facade tests: mock crawler_core, test async wrapping
```

Phases 2 and 3 can run in parallel (both depend only on Phase 1, not each other).

---

## Sources

- Direct inspection: `/Users/win/2025/AICoding/JobData/app/services/crawler/_http_client.py` (docstring confirms copy from `spiderJobs/core/http_client.py`)
- Direct inspection: `/Users/win/2025/AICoding/JobData/spiderJobs/core/base.py` vs `app/services/crawler/_base.py` — byte-identical logic
- Direct inspection: `/Users/win/2025/AICoding/JobData/app/services/company_cleaner.py` — async consumer of sync crawlers
- Direct inspection: `/Users/win/2025/AICoding/JobData/spiderJobs/runner/loop.py` — sync-only external runner
- Direct inspection: `/Users/win/2025/AICoding/JobData/.planning/codebase/ARCHITECTURE.md` — existing architecture analysis

---

*Architecture analysis: 2026-03-21*