win f3005ef525 docs: add research findings (stack, features, architecture, pitfalls, summary)

2026-03-21 16:36:37 +08:00

20 KiB

Raw Blame History

Architecture Patterns: Shared Crawler Core

Domain: Recruiting data crawler — dual-deployment shared core Researched: 2026-03-21 Overall confidence: HIGH (based entirely on direct codebase inspection)

Problem Statement

The codebase currently has the same code in two places, copied verbatim:

File	Copy location
`spiderJobs/core/http_client.py`	`app/services/crawler/_http_client.py`
`spiderJobs/core/base.py`	`app/services/crawler/_base.py`
`spiderJobs/platforms/boss/sign.py`	`app/services/crawler/_boss_sign.py`
`spiderJobs/platforms/boss/client.py`	`app/services/crawler/_boss_client.py`
`spiderJobs/platforms/boss/api.py`	`app/services/crawler/_boss_api.py`

The docstrings in app/services/crawler/_http_client.py literally say: "复制自 spiderJobs/core/http_client.py — 不要直接 import spiderJobs，避免跨模块依赖" ("copied from — do not import directly to avoid cross-module dependency"). The comment reveals the source of the problem: the team correctly identified that two separate deployment units cannot cross-import, but solved it by copy-paste rather than extraction.

There are also two competing external spider frameworks: jobs_spider/ (older, monolithic, single large file per platform) and spiderJobs/ (newer, modular, uses runner/loop.py + runner/api_client.py as a proper execution harness). The refactor must unify these.

Recommended Architecture

Core Design Decision: Installable Shared Package

Extract shared crawler code into a proper Python package at crawler_core/ (project root level). Both deployment contexts import from it. No copying. No circular imports.

crawler_core/                     # Shared library — zero app/* or spiderJobs/* imports
├── __init__.py
├── http_client.py                # HTTPClient (requests-go, TLS fingerprint, proxy rotation)
├── base.py                       # ApiResult, BaseFetcher, BaseSearcher, parse_response()
├── boss/
│   ├── __init__.py
│   ├── sign.py                   # BossSign (Traceid generation algorithm)
│   ├── client.py                 # BossClient(HTTPClient) — header injection
│   └── api.py                    # SearchRecJobs, GetBrandDetail, etc.
├── qcwy/
│   ├── __init__.py
│   ├── sign.py
│   ├── client.py
│   └── api.py
└── zhilian/
    ├── __init__.py
    ├── sign.py
    ├── client.py
    └── api.py

Why a root-level package and not inside app/ or spiderJobs/:

app/ is the FastAPI backend. Importing it from external scripts would pull in Tortoise-ORM, APScheduler, ClickHouse, and every other backend dependency — making external scripts impossibly heavy.
spiderJobs/ is a deployment directory, not a library.
A root-level crawler_core/ has no such baggage. It imports only requests-go and stdlib. Both app/ and spiderJobs/ import from it.

Constraints satisfied:

No circular imports
No shared runtime state (crawler_core is stateless pure logic)
Installable as local package (editable install via pip install -e . or pipenv install -e .) for ECS deployment
Tests can import crawler_core directly without mocking a FastAPI app

Component Boundaries

What Talks to What

┌─────────────────────────────────────────────────────────────┐
│                      crawler_core/                          │
│   HTTPClient · BaseFetcher · BaseSearcher · ApiResult       │
│   boss/ · qcwy/ · zhilian/  (sign + client + api)          │
│   Imports: requests-go, stdlib only                         │
└────────────────┬────────────────────────┬───────────────────┘
                 │                        │
    ┌────────────▼──────────┐  ┌─────────▼─────────────────┐
    │  app/services/        │  │  spiderJobs/               │
    │  crawler/             │  │  platforms/ + runner/      │
    │                       │  │                            │
    │  BossService facade   │  │  run_crawl_loop()          │
    │  QcwyService facade   │  │  RunnerAPIClient           │
    │  ZhilianService facade│  │  (HTTP push to backend)    │
    │                       │  │                            │
    │  Called by:           │  │  Called by:                │
    │  CompanyCleaner       │  │  ECS node scripts          │
    │  CompanyController    │  │  jobs_spider/ (deprecated) │
    └────────────┬──────────┘  └────────────────────────────┘
                 │
    ┌────────────▼──────────────────────────────────────────┐
    │  app/  (FastAPI backend — async boundary)             │
    │  CompanyCleaner (asyncio, calls sync crawler via      │
    │  asyncio.to_thread())                                 │
    │  IngestService (async ClickHouse writes)              │
    │  APScheduler (triggers CompanyCleaner async jobs)     │
    └───────────────────────────────────────────────────────┘

Component Responsibilities

Component	Responsibility	What It Must NOT Do
`crawler_core/`	Signing, HTTP requests, response parsing — per platform	Import `app.*`, use async, maintain global state
`app/services/crawler/` facades	Wrap `crawler_core` with logging, token loading, proxy mgmt	Contain signing or HTTP logic
`spiderJobs/runner/`	Crawl loop orchestration, keyword polling, data upload	Import `app.*`
`app/services/ingest/`	Registry-driven ClickHouse insertion, dedup	Know anything about platform HTTP clients
`app/core/scheduler.py`	Schedule jobs, manage distributed lock	Run synchronous I/O on the event loop directly

Abstract Base Class Hierarchy

# crawler_core/base.py

@dataclass
class ApiResult:
    success: bool
    status_code: int
    data: Any = None
    list: list[dict] = field(default_factory=list)
    count: int = 0
    is_end_page: bool = True
    error: Optional[str] = None


class BaseFetcher:
    """Single-resource GET (company detail, job detail)."""
    ENDPOINT: str = ""

    def __init__(self, http_client: HTTPClient): ...
    def _build_params(self) -> dict: raise NotImplementedError
    def _parse(self, http_code: int, raw: Any) -> ApiResult: ...
    def fetch(self) -> ApiResult: ...          # sync — wraps _http.get()


class BaseSearcher:
    """Paginated search (job list, brand job list)."""
    ENDPOINT: str = ""

    def __init__(self, page_size: int, http_client: HTTPClient): ...
    def _build_params(self, page_index: int) -> dict: raise NotImplementedError
    def _request(self, params: dict) -> tuple[int, Any]: ...
    def _parse(self, http_code: int, raw: Any) -> ApiResult: ...
    def search(self, page_index: int = 1) -> ApiResult: ...  # sync
    def load_all(self, max_pages: int, on_page: Callable | None) -> list[dict]: ...

Platform clients subclass HTTPClient from crawler_core/http_client.py:

# crawler_core/boss/client.py
class BossClient(HTTPClient):
    """Injects mpt/wt2/Traceid headers on every request."""
    def __init__(self, signer: BossSign, ...): ...
    def post(self, path, body, headers=None) -> tuple[int, Any]: ...  # sync
    def get(self, path, params=None, headers=None) -> tuple[int, Any]: ...  # sync

Platform API classes subclass BaseFetcher / BaseSearcher:

# crawler_core/boss/api.py
class SearchRecJobs(BaseSearcher):
    ENDPOINT = "/wapi/zpgeek/miniapp/homepage/recjoblist.json"
    def _build_params(self, page_index: int) -> dict: ...
    def _parse(self, http_code: int, raw: Any) -> ApiResult: ...  # Boss-specific parsing

class GetBrandDetail(BaseFetcher):
    ENDPOINT = "/wapi/zpgeek/miniapp/brand/get.json"
    def _build_params(self) -> dict: ...

The hierarchy is deliberately flat. Three levels: HTTPClient → PlatformClient → PlatformAPIClass. No further abstraction.

Sync vs Async: The Critical Design Point

Why the Core Must Stay Synchronous

crawler_core/ uses requests-go (a sync library that wraps a TLS fingerprinting mechanism). There is no async version of requests-go. The platform APIs require TLS fingerprint spoofing that httpx does not support without significant additional work. Keeping the core synchronous is the correct choice given the constraint.

How the Backend (async) Calls Synchronous Crawlers

CompanyCleaner runs inside the FastAPI async event loop (called by APScheduler). It cannot block the event loop with synchronous requests calls. The correct pattern is asyncio.to_thread():

# app/services/company_cleaner.py

async def _fetch_company_detail(self, source: str, company_id: str) -> Optional[dict]:
    """Run synchronous crawler call off the event loop thread."""
    if source == "boss":
        return await asyncio.to_thread(
            self.boss_service.get_company_detail_by_id, company_id
        )
    elif source == "qcwy":
        return await asyncio.to_thread(
            self.qcwy_service.get_company_detail_by_id, company_id
        )
    elif source == "zhilian":
        return await asyncio.to_thread(
            self.zhilian_service.get_company_detail_by_id, company_id
        )

asyncio.to_thread() runs the synchronous function in a thread pool, yielding back to the event loop while it executes. This is the standard Python pattern for sync I/O in an async context.

Current state: The existing CompanyCleaner.process_pending_companies() likely calls sync methods directly on the event loop — this is a bug to fix during refactoring. The fix is wrapping each crawler call with asyncio.to_thread().

External Scripts: Pure Synchronous, No Async

spiderJobs/runner/loop.py (run_crawl_loop) and any ECS scripts are pure synchronous while True loops. They import from crawler_core/ directly. No async. This is correct and intentional.

External Script (sync)          Backend (async)
──────────────────────          ─────────────────────────────
while True:                     async def process_company():
  kw = api.fetch_keyword()        detail = await asyncio.to_thread(
  searcher = BossSearcher(...)        boss_service.get_company_detail
  result = searcher.search()      )
  api.upload_data(result.list)    await company_storage.save(detail)

Data Flow Patterns

Flow 1: External Spider → Backend Storage

ECS Script (spiderJobs/runner/loop.py)
  │
  ├─ GET /api/v1/keyword/available
  │    → KeywordController marks keyword as "crawling"
  │
  ├─ crawler_core/boss/api.py SearchRecJobs.search(page)
  │    → returns ApiResult.list (raw job dicts)
  │
  ├─ POST /api/v1/universal/data/batch-store-async
  │    body: {platform, channel, data_type, data_list}
  │
  └─ Backend: IngestService.store_batch()
       → registry lookup → dedup → ClickHouse insert

The external script knows nothing about ClickHouse. It only pushes raw dicts over HTTP. This boundary is correct and must stay.

Flow 2: In-Process Company Cleaning (Scheduled)

APScheduler → company_cleaning_job() [async]
  │
  ├─ CompanyCleaner.collect_pending_companies()  [async]
  │    → ClickHouse: query job tables for company IDs
  │    → MySQL: check which company IDs already processed
  │    → MySQL: insert new companies into CompanyCleaningQueue
  │
  └─ CompanyCleaner.process_pending_companies()  [async]
       ├─ MySQL: fetch N companies from queue
       ├─ asyncio.to_thread(boss_service.get_company_detail_by_id)
       │    → crawler_core/boss/api.py GetBrandDetail.fetch() [sync]
       ├─ company_storage.save_company()          [async, ClickHouse]
       └─ MySQL: update company status to "done"

The flow crosses the sync/async boundary exactly once: at asyncio.to_thread(). Everything above (ClickHouse queries, MySQL operations) stays async. Everything below (HTTP to Boss API) stays sync.

Flow 3: Company Jobs Sync (new, post-refactor)

After a company is cleaned, the system should backfill that company's current job listings:

CompanyJobsSyncService.sync_company_jobs(company_id, source) [async]
  ├─ asyncio.to_thread(boss_service.get_company_jobs_by_id, company_id)
  │    → crawler_core/boss/api.py SearchBrandJobs.search() [sync]
  └─ IngestService.store_batch(platform="boss", data_type="company_job")
       → ClickHouse insert into boss_company_job table (new table needed)

Package Structure: Suggested Build Order

Build crawler_core/ first — everything else depends on it.

Phase 1: Extract `crawler_core/`

Create crawler_core/__init__.py, crawler_core/http_client.py, crawler_core/base.py
- These are identical to the current spiderJobs/core/ files with one import path change
- Zero functional change; pure extraction
Create crawler_core/boss/, crawler_core/qcwy/, crawler_core/zhilian/
- Extract from spiderJobs/platforms/boss/, qcwy/, zhilian/
- Update internal imports to from crawler_core.boss.sign import BossSign, etc.
Add crawler_core to Pipfile as an editable local dependency:
```
[packages]
crawler_core = {path = ".", editable = true}
```
(Or configure pyproject.toml with [tool.setuptools.packages.find] including crawler_core)

Verification gate: spiderJobs/ and app/services/crawler/ both import from crawler_core without errors. spiderJobs/core/ and spiderJobs/platforms/ can be deleted.

Phase 2: Rewire `app/services/crawler/` Facades

The service facades (boss.py, qcwy.py, zhilian.py) stay in app/services/crawler/ — they belong to the backend. Their job is to:

Load tokens from MySQL (BossToken)
Load proxies from ProxyController
Instantiate crawler_core clients with those credentials
Expose typed async-friendly methods (get_company_detail_by_id)
Wrap asyncio.to_thread() for each sync call

The private underscore modules (_boss_client.py, _boss_sign.py, etc.) are deleted. Facades import directly from crawler_core.

# app/services/crawler/boss.py (after refactor)
from crawler_core.boss.client import BossClient, create_client
from crawler_core.boss.sign import BossSign
from crawler_core.boss.api import GetBrandDetail, SearchBrandJobs

Verification gate: CompanyCleaner test passes with mocked crawler_core calls.

Phase 3: Rewire `spiderJobs/` External Scripts

spiderJobs/runner/ imports from spiderJobs.core.base import ... → change to from crawler_core.base import .... Delete spiderJobs/core/ and spiderJobs/platforms/.

spiderJobs/platforms/boss/main.py becomes a thin entry point:

from crawler_core.boss.client import create_client
from crawler_core.boss.api import SearchRecJobs
from spiderJobs.runner.loop import run_crawl_loop

jobs_spider/ (the older framework with monolithic files) is deprecated in favor of spiderJobs/ + crawler_core/.

Anti-Patterns to Avoid

Anti-Pattern 1: Async Crawler Core

What: Making crawler_core/ async because the backend is async. Why bad: requests-go is sync. The TLS fingerprint mechanism is sync. Wrapping everything in asyncio.run() or creating fake async wrappers adds complexity with no benefit. The asyncio.to_thread() boundary at the facade layer is the correct and idiomatic solution. Instead: Keep crawler_core/ synchronous. Use asyncio.to_thread() in the backend facades.

Anti-Pattern 2: Putting `RunnerAPIClient` in `crawler_core/`

What: Moving the HTTP push client (backend API caller) into the shared core. Why bad: RunnerAPIClient is deployment-specific: it knows about /api/v1/keyword/available, /api/v1/universal/data/batch-store-async, and the backend's URL scheme. That is runner orchestration logic, not platform API logic. Instead: RunnerAPIClient stays in spiderJobs/runner/api_client.py. It is not shared.

Anti-Pattern 3: Importing `app.*` from External Scripts

What: Having ECS scripts import from app.settings.config, app.models, or app.core.*. Why bad: That would require Tortoise-ORM initialization, database connections, and all backend startup logic to run on ECS nodes. This would fail. Instead: External scripts get all config from environment variables. They communicate with the backend only via HTTP.

Anti-Pattern 4: Blocking the Event Loop with Sync Crawler Calls

What: Calling boss_service.get_company_detail_by_id() directly in an async def without asyncio.to_thread(). Why bad: Each HTTP call takes 2-10 seconds. With 30 companies to process, that blocks the event loop for up to 5 minutes. Other requests, heartbeats, and scheduler jobs starve. Instead: Always wrap sync crawler calls in asyncio.to_thread() inside async context.

Anti-Pattern 5: Monolithic Platform Files

What: Keeping the existing jobs_spider/boss/boos_api.py (2246 lines) pattern in the refactored code. Why bad: Impossible to test individual components. Signing, HTTP, and business logic are entangled. Instead: Three files per platform: sign.py (pure functions, no I/O), client.py (HTTP wrapper, no business logic), api.py (endpoint classes).

Scalability Considerations

Concern	Current	After Refactor
Adding a 4th platform (e.g., job51)	Copy files into both `spiderJobs/` and `app/services/crawler/`	Add `crawler_core/job51/` once; both consumers get it
Testing signing algorithm	Must run full crawler	Test `crawler_core/boss/sign.py` in isolation — pure functions, no I/O
Updating User-Agent headers	Two files to update	One file: `crawler_core/boss/client.py`
Running multiple concurrent companies	Single-threaded	`asyncio.to_thread()` uses thread pool; default pool size = CPU count × 5
ECS deployment of crawlers	Must deploy `spiderJobs/` + copy of core	Deploy `crawler_core/` + `spiderJobs/`; single source of truth

Suggested Build Order for Roadmap Phases

Dependencies flow in one direction:

Phase 1: crawler_core/ extraction
  └─ Produces: installable shared package, 3-platform coverage

Phase 2: app/services/crawler/ facade rewire
  └─ Depends on: Phase 1 (imports crawler_core)
  └─ Produces: asyncio.to_thread() boundary, deleted underscore copies

Phase 3: spiderJobs/ rewire + jobs_spider/ deprecation
  └─ Depends on: Phase 1 (imports crawler_core)
  └─ Produces: single external runner codebase

Phase 4: Company cleaning flow optimization
  └─ Depends on: Phase 2 (async-safe facades)
  └─ Produces: CompanyJobsSyncService using new facades

Phase 5: Unit tests
  └─ Depends on: Phase 1 (crawler_core is testable in isolation)
  └─ crawler_core tests: no mocking needed for sign.py (pure functions)
  └─ facade tests: mock crawler_core, test async wrapping

Phases 2 and 3 can run in parallel (both depend only on Phase 1, not each other).

Sources

Direct inspection: /Users/win/2025/AICoding/JobData/app/services/crawler/_http_client.py (docstring confirms copy from spiderJobs/core/http_client.py)
Direct inspection: /Users/win/2025/AICoding/JobData/spiderJobs/core/base.py vs app/services/crawler/_base.py — byte-identical logic
Direct inspection: /Users/win/2025/AICoding/JobData/app/services/company_cleaner.py — async consumer of sync crawlers
Direct inspection: /Users/win/2025/AICoding/JobData/spiderJobs/runner/loop.py — sync-only external runner
Direct inspection: /Users/win/2025/AICoding/JobData/.planning/codebase/ARCHITECTURE.md — existing architecture analysis

Architecture analysis: 2026-03-21

20 KiB Raw Blame History Unescape Escape