JobData/.planning/codebase/ARCHITECTURE.md
win 00a727519f docs(phase-3): complete execution — 2/2 plans, 98 tests passing
- ARCH-04: job51 migrated to crawler_core (no old deps)
- ARCH-05: zhilian migrated to crawler_core (no old deps)
- 34 new mock tests (17 job51 + 17 zhilian)
- Added _parse_zhilian_response custom parser for zhilian API format
- Fixed POST Searcher _request() overrides for job51/zhilian
- Full regression: 98 passed in 0.12s
2026-03-21 19:19:17 +08:00

202 lines
12 KiB
Markdown

# Architecture
**Analysis Date:** 2026-03-21
## Pattern Overview
**Overall:** Layered monolith with embedded pipeline architecture
**Key Characteristics:**
- FastAPI async backend with dual-database persistence (MySQL for RBAC/metadata, ClickHouse for job/company analytics data)
- External crawlers (`jobs_spider/`, `spiderJobs/`) run as independent processes and push data to the backend API via HTTP POST — no shared code path at runtime
- RBAC permission model enforced at router-level via FastAPI `Depends`
- APScheduler drives all recurring automation (cleaning, ECS orchestration, alerting) within the main process
- Registry pattern (`app/services/ingest/registry.py`) maps (platform, channel, data_type) tuples to ClickHouse table configs + dedup rules
## Layers
**Router Layer:**
- Purpose: Parse HTTP request, validate via Pydantic schema, delegate to service or controller, return JSON envelope
- Location: `app/api/v1/`
- Contains: One sub-package per domain (e.g., `job/`, `cleaning/`, `analytics/`, `keyword/`)
- Depends on: Services, Controllers, `app/core/dependency.py` for auth
- Used by: FastAPI application via `app/api/v1/__init__.py``app/__init__.py`
**Service Layer:**
- Purpose: Business logic, orchestration across repos/models, side-effect coordination
- Location: `app/services/`
- Contains: `IngestService`, `AnalyticsService`, `CleaningService`, `CompanyCleaner`, `CompanyJobsSyncService`, platform crawler service wrappers (`crawler/boss.py`, `crawler/qcwy.py`, `crawler/zhilian.py`)
- Depends on: Repositories, ORM models, `clickhouse_manager`, external HTTP clients
- Used by: Routers, APScheduler jobs
**Controller Layer (RBAC domain operations):**
- Purpose: Thin CRUD orchestration for MySQL-backed entities (User, Role, API, Menu, Dept, etc.)
- Location: `app/controllers/`
- Contains: `user_controller`, `api_controller`, `role_controller`, `menu_controller`, `dept_controller`, `keyword_controller`, `proxy_controller`, `token_controller`, `cleaning_controller`, `company_controller`
- Depends on: Tortoise-ORM models
- Used by: Routers
**Repository Layer:**
- Purpose: Encapsulate raw ClickHouse query execution behind typed methods
- Location: `app/repositories/`
- Contains: `ClickHouseBaseRepo` (generic query/insert), `JobAnalyticsRepo` (trend, distribution, count queries against `job_analytics` view)
- Depends on: `clickhouse_connect.driver.AsyncClient`
- Used by: `AnalyticsService`
**Model Layer:**
- Purpose: ORM schema definitions for MySQL, migrated via Aerich
- Location: `app/models/`
- Contains: `admin.py` (User, Role, Api, Menu, Dept, AuditLog), `token.py` (BossToken), `cleaning.py` (CleaningTask), `metrics.py` (ScheduledTaskRun, StatsTotal, IpUploadStats), `company.py` (CompanyCleaningQueue), `keyword.py` (BossKeyword, QcwyKeyword, ZhilianKeyword)
- Depends on: `tortoise-orm`
- Used by: Controllers, Services, Scheduler jobs
**Core Infrastructure Layer:**
- Purpose: App wiring, connection management, middleware, cross-cutting utilities
- Location: `app/core/`
- Contains: `init_app.py` (lifespan bootstrap), `scheduler.py` (APScheduler jobs), `clickhouse.py` (ClickHouseManager singleton), `clickhouse_init.py` (ClickHouse DDL), `dependency.py` (AuthControl, PermissionControl), `locks.py` (DistributedLock), `middlewares.py` (BackGroundTaskMiddleware, HttpAuditLogMiddleware, IpTrackingMiddleware), `exceptions.py`, `crud.py`, `ip_tracking.py`
- Depends on: Everything below
- Used by: `app/__init__.py`, all layers above
**Ingest Sub-system:**
- Purpose: Platform-agnostic data ingestion pipeline with registry-driven routing, dedup, and remote push
- Location: `app/services/ingest/`
- Contains: `service.py` (IngestService), `registry.py` (PlatformConfig registry), `dedup.py` (batch dedup filter), `remote_push.py` (async HTTP push to secondary target), `configs/` (per-platform PlatformConfig registrations)
- Depends on: `clickhouse_connect`, registry configs
- Used by: `app/api/v1/job/job.py` router
**Spider Crawler Service Wrappers (in-process):**
- Purpose: Wrap platform HTTP clients so backend scheduler can call crawlers directly (company cleaning, company search)
- Location: `app/services/crawler/`
- Contains: `_base.py` (BaseFetcher, BaseSearcher, ApiResult), `_http_client.py`, `_boss_client.py`, `_boss_api.py`, `_boss_sign.py`, `_zhilian_client.py`, `_zhilian_api.py`, `_zhilian_sign.py`, `_job51_client.py`, `_job51_api.py`, `_job51_sign.py`, `boss.py`, `qcwy.py`, `zhilian.py`
- Depends on: `requests` (sync), platform sign/auth modules
- Used by: `CompanyCleaner`, `CompanyController`
**External Spider Processes:**
- Purpose: Standalone crawlers running on ECS nodes; push data to backend via `POST /api/v1/universal/data/batch-store-async`
- Location: `jobs_spider/boss/boos_api.py`, `jobs_spider/qcwy/qcwy.py`, `jobs_spider/zhilian/zhilian_single.py`, `spiderJobs/`
- Depends on: `requests`, `loguru`, `PyExecJS`; no shared Python imports with `app/`
- Used by: Scheduled via `ecs_full_pipeline.py` or manual launch
## Data Flow
**Crawler → Storage Flow:**
1. External spider (`jobs_spider/boss/boos_api.py`) fetches keyword from `GET /api/v1/keyword/available`
2. Spider calls Boss/QCWY/Zhilian platform API with rotating proxy/token
3. Spider POSTs raw JSON batch to `POST /api/v1/universal/data/batch-store-async`
4. Router (`app/api/v1/job/job.py`) delegates to `IngestService.store_batch()`
5. `IngestService` looks up `PlatformConfig` from registry (`app/services/ingest/registry.py`)
6. Dedup filter queries ClickHouse for existing records (`app/services/ingest/dedup.py`)
7. New rows inserted to ClickHouse table (e.g., `job_data.boss_job`) via `clickhouse_connect`
8. Optional async push to secondary remote endpoint via `remote_push.py`
**Analytics Query Flow:**
1. Frontend calls `GET /api/v1/analytics/volume-trend?interval=day&from_dt=...`
2. Router (`app/api/v1/analytics.py`) creates `AnalyticsService` with `clickhouse_manager` client
3. `AnalyticsService` delegates to `JobAnalyticsRepo.get_volume_trend()`
4. `JobAnalyticsRepo` queries the `job_data.job_analytics` VIEW (UNION ALL across three platform job tables)
5. Returns time-bucketed counts grouped by source
**RBAC Authentication Flow:**
1. Client sends `POST /api/v1/base/access_token` with credentials
2. `AuthControl.is_authed()` verifies JWT (HS256, 7-day expiry) from `token` header
3. `PermissionControl.has_permission()` loads user's roles → role's API list
4. Compares `(method, path)` tuple against allowed API set
5. Superusers bypass permission check
**Company Cleaning Flow (Scheduled):**
1. `company_cleaning_job()` in `app/core/scheduler.py` fires every 5 minutes
2. `CompanyCleaner.collect_pending_companies(limit=50)` queries ClickHouse job tables for company IDs not yet in `pending_company` table
3. `CompanyCleaner.process_pending_companies(limit=30)` fetches company details from platform crawlers (BossService, QcwyService, ZhilianService)
4. Cleaned company data stored via `company_storage` to ClickHouse `boss_company` / `qcwy_company` / `zhilian_company`
5. Run protected by `DistributedLock` (file-based fallback, optional Redis)
**State Management:**
- MySQL (via Tortoise-ORM): User sessions, RBAC entities, keyword crawl state, tokens, task run records
- ClickHouse: All raw job/company JSON data + analytics views; no ORM — raw SQL via `clickhouse_connect`
- Context variable `CTX_USER_ID` (Python `contextvars.ContextVar`) carries authenticated user ID within request scope (`app/core/ctx.py`)
- File-based startup lock (`.startup_lock/` dir) prevents multi-worker seed data collisions
## Key Abstractions
**PlatformConfig (Ingest Registry):**
- Purpose: Maps `(platform, channel, data_type)` to target ClickHouse table and dedup field extractors
- Examples: `app/services/ingest/registry.py`, `app/services/ingest/configs/`
- Pattern: Immutable `@dataclass(frozen=True)`, registered at module import time via `register()`
**DistributedLock:**
- Purpose: Prevents concurrent execution of scheduled jobs across multiple Uvicorn workers
- Examples: `app/core/locks.py`
- Pattern: Redis SET NX with TTL; falls back to filesystem directory creation with TTL metadata
**ClickHouseManager:**
- Purpose: Singleton lazy-init connection wrapper around `clickhouse_connect.AsyncClient`
- Examples: `app/core/clickhouse.py`
- Pattern: Global module-level instance `clickhouse_manager`, injected into services at call time via `await clickhouse_manager.get_client()`
**BaseFetcher / BaseSearcher:**
- Purpose: Abstract base for platform-specific API clients within the in-process crawler wrappers
- Examples: `app/services/crawler/_base.py`
- Pattern: Template method — subclasses implement `_build_params()`, base handles HTTP invocation and `ApiResult` parsing
**BaseModel (Tortoise):**
- Purpose: Common ORM base with `BigIntField pk`, `to_dict()` async serializer with M2M support
- Examples: `app/models/base.py`
- Pattern: Mixed-in with `TimestampMixin` (created_at, updated_at auto fields)
## Entry Points
**Application Entry:**
- Location: `run.py`
- Triggers: `python run.py` — reads `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS` env vars; calls `uvicorn.run("app:app")`
- Responsibilities: Configure uvicorn logging format, start ASGI server
**FastAPI App Factory:**
- Location: `app/__init__.py``create_app()`
- Triggers: Uvicorn import of `app:app`
- Responsibilities: Register middlewares, exception handlers, routers, static files; define lifespan
**Lifespan Bootstrap:**
- Location: `app/__init__.py` (lifespan context manager) + `app/core/init_app.py` (`init_data()`)
- Triggers: FastAPI startup event
- Responsibilities: Tortoise-ORM init + schema generation → Aerich migrations → seed data (superuser, menus, APIs, roles) → ClickHouse table init → APScheduler start
**Router Aggregation:**
- Location: `app/api/v1/__init__.py`
- Triggers: Imported by `app/api/__init__.py``app/__init__.py`
- Responsibilities: Mount all domain routers under `/api/v1/` with appropriate `DependPermission` guards
**ECS Pipeline:**
- Location: `ecs_full_pipeline.py`
- Triggers: Manual execution or scheduled via `scheduler.py` (`ecs_full_pipeline_job`) every 6 hours
- Responsibilities: Create Alibaba Cloud ECS spot instances, push spider scripts, execute crawlers, terminate instances
## Error Handling
**Strategy:** Exception handler registration at FastAPI app level; structured error propagation up through layers
**Patterns:**
- Tortoise `DoesNotExist` → 404 via `DoesNotExistHandle` (registered in `app/core/init_app.py`)
- Pydantic `RequestValidationError` → 422 with field-level detail via `RequestValidationHandle`
- `IntegrityError` (DB constraint) → 400 via `IntegrityHandle`
- Custom `HTTPException` (from `app/core/exceptions.py`) → structured JSON error via `HttpExcHandle`
- Service layer exceptions logged with `loguru` then re-raised or returned as `{"success": False, "error": "..."}`
- Scheduled jobs catch all exceptions, record failure via `_record_task_run()` to `ScheduledTaskRun` table, then continue
## Cross-Cutting Concerns
**Logging:** `loguru` — imported from `app/log/__init__.py`; structured `logger.info/error/warning` calls throughout; external spider logs to rotating daily files in `jobs_spider/*/logs/`
**Validation:** Pydantic v2 schemas in `app/schemas/` validate all inbound HTTP request bodies; ClickHouse inserts use registry-defined column extractors for type safety
**Authentication:** JWT HS256 tokens issued by `/api/v1/base/access_token`; verified per-request by `AuthControl.is_authed()` injected via `DependAuth` / `DependPermission`; dev bypass (`token: dev`) available when `APP_ENV=development`
**Audit Logging:** `HttpAuditLogMiddleware` records all non-excluded mutating requests (method, path, user, request args, response body, timing) to MySQL `auditlog` table
---
*Architecture analysis: 2026-03-21*