JobData/CLAUDE.md
2026-03-21 17:00:12 +08:00

486 lines
28 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# JobData - 招聘数据采集与统计分析平台
## 变更记录 (Changelog)
| 版本 | 日期 | 说明 |
|------|------|------|
| 初始化 | 2026-03-20 | 首次生成架构文档,覆盖全部四个核心模块 |
---
## 项目愿景
JobData 是一个面向招聘市场的全栈数据采集与分析平台。系统从三大主流招聘平台Boss 直聘、前程无忧、智联招聘)自动抓取职位与公司数据,统一存储到 ClickHouse 列式数据库,并通过 FastAPI 后端 + Vue3 前端提供数据查看、定向清洗、统计分析等能力。ECS 弹性实例管理模块支持在阿里云上按需批量启停爬虫节点。
---
## 架构总览
```
[招聘平台] [爬虫层] [后端 API] [数据库]
Boss直聘 ──► jobs_spider/boss/ ──► ┌── MySQL
前程无忧 ──► jobs_spider/qcwy/ ──► app (FastAPI) ──► └── ClickHouse
智联招聘 ──► jobs_spider/zhilian/ ──►
web (Vue3) ────────┘
(前端页面)
ecs_full_pipeline.py ──► 阿里云 ECS ──► 批量启动爬虫节点
```
**核心技术栈**
| 层次 | 技术 |
|------|------|
| 后端 | Python 3.13, FastAPI 0.111, Tortoise-ORM 0.23, APScheduler |
| 数据库(业务) | MySQL用户/权限/审计/关键词/Token |
| 数据库(采集) | ClickHouse职位/公司 JSON 原始数据 + 分析视图) |
| 爬虫 | requests / httpx / PlaywrightPython 脚本) |
| 前端 | Vue 3.3, Vite 4, Naive UI, Pinia, ECharts |
| 基础设施 | 阿里云 ECS按量抢占实例APScheduler 定时任务 |
---
## 模块结构图
```mermaid
graph TD
ROOT["(根) JobData"] --> APP["app - FastAPI 后端"]
ROOT --> WEB["web - Vue3 前端"]
ROOT --> SPIDER["jobs_spider - 平台爬虫"]
ROOT --> ECS["ecs_full_pipeline.py - ECS 批量部署"]
APP --> APP_API["app/api - 路由层"]
APP --> APP_SVC["app/services - 业务逻辑"]
APP --> APP_CORE["app/core - 框架核心"]
APP --> APP_MODELS["app/models - ORM 模型"]
APP --> APP_REPO["app/repositories - 数据仓库"]
SPIDER --> SP_BOSS["jobs_spider/boss"]
SPIDER --> SP_QCWY["jobs_spider/qcwy"]
SPIDER --> SP_ZL["jobs_spider/zhilian"]
click APP "./app/CLAUDE.md" "查看 app 模块文档"
click WEB "./web/CLAUDE.md" "查看 web 模块文档"
click SPIDER "./jobs_spider/CLAUDE.md" "查看 jobs_spider 模块文档"
```
---
## 模块索引
| 模块路径 | 语言 | 职责简述 |
|----------|------|----------|
| `app/` | Python | FastAPI 后端,提供 REST API、权限管理、定时任务、数据入库与分析 |
| `web/` | Vue3/JS | 前端管理界面,数据展示、关键词管理、代理管理、数据清洗操作 |
| `jobs_spider/` | Python | 三大平台的爬虫脚本,独立运行,结果通过 HTTP 推送到后端 |
| `ecs_full_pipeline.py` | Python | 阿里云 ECS 实例批量创建/销毁/命令下发全流程脚本 |
| `reclean_qcwy_jobs.py` | Python | 前程无忧数据重清洗独立脚本 |
---
## 运行与开发
### 后端启动
```bash
# 安装依赖pipenv
pipenv install
# 开发模式(默认端口 999920 个 worker
python run.py
# 环境变量覆盖
APP_HOST=0.0.0.0 APP_PORT=9999 UVICORN_WORKERS=4 python run.py
```
**关键环境变量**
| 变量 | 默认值 | 说明 |
|------|--------|------|
| `APP_HOST` | `0.0.0.0` | 监听地址 |
| `APP_PORT` | `9999` | 监听端口 |
| `UVICORN_WORKERS` | `20` | Worker 数量 |
| `CLICKHOUSE_HOST` | `121.4.126.241` | ClickHouse 地址(需修改为实际地址) |
| `CLICKHOUSE_USER` / `CLICKHOUSE_PASS` | 见 config.py | ClickHouse 认证 |
| `SMTP_HOST` / `SMTP_USER` / `SMTP_PASS` | 见 config.py | 邮件告警配置 |
| `REPORT_ENDPOINT` | 空 | 统计结果 Webhook 上报地址 |
| `RUN_MIGRATIONS_ON_STARTUP` | `True` | 是否启动时自动迁移 |
| `INITIALIZE_SEED_DATA_ON_STARTUP` | `True` | 是否启动时初始化种子数据 |
> 安全警告:`config.py` 中 `SECRET_KEY`、数据库连接串、SMTP 密码均为硬编码默认值,生产环境必须通过环境变量覆盖。
### 前端启动
```bash
cd web
pnpm install
pnpm dev # 开发模式,默认 http://localhost:5173
pnpm build # 构建产物到 web/dist
```
### ECS 批量爬虫部署
```bash
# 需配置阿里云凭据(环境变量或 ~/.alibabacloud/credentials
python ecs_full_pipeline.py
```
---
## 定时任务
APScheduler 在应用启动时注册以下任务(`app/core/scheduler.py`
| 任务 ID | 频率 | 职责 |
|---------|------|------|
| `stats_job` | 每 6 小时 | 统计 ClickHouse 各表总量并通过邮件/Webhook 上报 |
| `ecs_full_pipeline` | 每 6 小时 | 调用 `ecs_full_pipeline.py` 批量刷新爬虫节点 |
| `ip_alert_job` | 每 10 分钟 | 检查 IP 上报异常并告警 |
| `company_cleaning_job` | 每 5 分钟 | 自动清洗待处理公司数据collect 50 + process 30 |
| `daily_cleanup_job` | 每天 00:05 | 清理历史任务运行记录 |
所有任务通过分布式文件锁(或可选 Redis 锁)保证多 Worker 下只执行一次。
---
## 测试策略
- 当前代码库**无自动化测试文件**(缺口:单元测试、集成测试均缺失)。
- 推荐补充:
1. `app/services/` 的 service 层单元测试(使用 `pytest` + `anyio`
2. `app/api/v1/` 的 API 集成测试(使用 `httpx.AsyncClient`
3. `jobs_spider/` 的数据解析函数单元测试
---
## 编码规范
- Python使用 `ruff`(已在 Pipfile 中),格式化用 `black`,排序用 `isort`
- 前端ESLint`@zclzone` + `@unocss` 规则集),`prettier` 格式化。
- 类型:后端强制 `pydantic` Schema 做入参校验;前端以 JS 为主(未启用严格 TS
- 日志:后端统一使用 `loguru`,结构化字段 `logger.info(...)` 方式输出。
---
## AI 使用指引
- 修改爬虫逻辑时,重点关注反爬机制:`SmartIPManager``IPAnomalyDetector``jobs_spider/boss/boos_api.py` 中实现,随机延迟至少 10 秒。
- 新增 API 路由后需同步在 `app/api/v1/__init__.py` 注册,并执行 `api_controller.refresh_api()` 更新权限表。
- ClickHouse 表结构变更在 `app/core/clickhouse_init.py` 中维护,**不走 Aerich 迁移**。
- MySQL 模型变更走 Aerich`aerich migrate && aerich upgrade`)。
- 前端新增页面需要在 `web/src/views/{模块}/route.js` 和后端 `init_menus()` 中同步注册菜单。
- `config.py` 中已硬编码真实 MySQL/ClickHouse 连接串和 SMTP 凭据,**提交代码前务必确认不泄露敏感信息**。
<!-- GSD:project-start source:PROJECT.md -->
## Project
**JobData 爬虫交互重构**
招聘数据采集平台的爬虫交互层重构项目。重写三个平台Boss直聘、前程无忧、智联招聘的爬虫代码统一基类和共享核心逻辑优化数据入库去重、公司信息清洗流程和前端监控页面。
**Core Value:** 基于关键词驱动爬虫抓取职位数据,可靠地入库到 ClickHouse并通过定时任务完成公司信息的采集和同步。
### Constraints
- **兼容性**: 重构期间不能中断现有爬虫数据采集
- **共享核心**: 外部脚本和后端必须用同一套签名/请求/解析代码,避免代码重复
- **数据库**: ClickHouse 表结构保持不变MySQL 模型变更走 Aerich 迁移
- **反爬**: 必须保留现有反爬机制IP 轮换、随机延迟、代理管理)
<!-- GSD:project-end -->
<!-- GSD:stack-start source:codebase/STACK.md -->
## Technology Stack
## Languages
- Python 3.13 - Backend API, crawlers, ECS pipeline scripts
- JavaScript (ES Modules) - Vue3 frontend (no strict TypeScript; TS installed but JS used in `.vue` and `.js` files)
- TypeScript 5.1.6 - Type definitions referenced in frontend toolchain only
## Runtime
- Python 3.13 (required: `>=3.13` per `Pipfile`; Dockerfile uses `python:3.11-slim-bullseye` for container build)
- Node.js 18 (Dockerfile `node:18-alpine` for frontend build stage)
- Python: `pipenv` (development), `pip` / `requirements.txt` (Docker production)
- Frontend: `pnpm` (lockfile: `web/pnpm-lock.yaml` present)
- Lockfiles: `Pipfile.lock` (Python), `web/pnpm-lock.yaml` (frontend), `uv.lock` (uv-compatible)
## Frameworks
- FastAPI 0.111.0 - REST API framework (`app/__init__.py` factory via `create_app()`)
- Starlette 0.37.2 - ASGI underpinning (middleware, static files)
- Uvicorn 0.34.0 - ASGI server (20 workers default, configurable via `UVICORN_WORKERS`)
- Uvloop 0.21.0 - High-performance event loop (non-Windows only)
- Tortoise-ORM 0.23.0 - Async ORM for MySQL (`app/models/`)
- Aerich 0.8.1 - Database migrations for Tortoise-ORM (`migrations/` directory)
- aiomysql - Async MySQL driver (Tortoise backend)
- clickhouse-connect 0.7.19 - ClickHouse async client (`app/core/clickhouse.py`)
- asyncpg - Async PostgreSQL driver (listed in Pipfile but not actively used)
- Pydantic 2.10.5 - Request/response schemas (`app/schemas/`)
- pydantic-settings 2.7.1 - Settings management via env vars (`app/settings/config.py`)
- orjson 3.10.14 - Fast JSON serialization
- ujson 5.10.0 - Alternative JSON library
- PyJWT 2.10.1 - JWT token generation/verification (HS256, 7-day expiry)
- passlib 1.7.4 - Password hashing
- argon2-cffi 23.1.0 - Argon2 password hasher
- APScheduler - Async job scheduler (`app/core/scheduler.py`, 6 registered cron tasks)
- httpx 0.28.1 - Async HTTP client (backend service-to-service calls)
- requests - Sync HTTP client (spider scripts in `jobs_spider/`)
- loguru 0.7.3 - Structured logging (unified throughout backend and spiders)
- Vue 3.3.4 - SPA framework (`web/src/main.js`)
- Vue Router 4.2.4 - Client-side routing (`web/src/router/index.js`)
- Pinia 2.1.6 - State management (`web/src/store/`)
- Naive UI 2.34.4 - Component library (admin UI)
- ECharts 6.0.0 - Data visualization charts (`web/src/views/analytics/`)
- axios 1.4.0 - HTTP client (`web/src/utils/http/`)
- vue-i18n 9 - Internationalization
- Vite 4.4.6 - Frontend bundler (`web/vite.config.js`)
- UnoCSS 66.5.10 - Atomic CSS engine
- unplugin-auto-import 20.3.0 - Auto-imports for Vue APIs
- unplugin-vue-components 30.0.0 - Auto-imports for components
- @iconify/vue + @iconify/json - Icon library
- playwright 1.57.0 - Browser automation (anti-detection in crawlers)
- PyExecJS 1.5.1 - Execute JavaScript signing algorithms from Python (`jobs_spider/boss/`)
- PySocks - SOCKS proxy support for spider requests
- tenacity - Retry logic in crawlers
- pandas + openpyxl - Data processing and Excel export
- alibabacloud_ecs20140526 - Alibaba Cloud ECS SDK (`ecs_full_pipeline.py`)
- alibabacloud_credentials - AliCloud credential management
- ruff 0.9.1 - Python linter (`pyproject.toml` config: line-length 120, ignores F403/F405)
- black 24.10.0 - Python formatter (line-length 120, target py310/py311)
- isort 5.13.2 - Import sorter
- ESLint 8.46.0 - Frontend linter (`@zclzone` + `@unocss` rule sets)
- prettier - Frontend formatter
## Key Dependencies
- `clickhouse-connect==0.7.19` - All analytics and job data storage; loss means no data read/write
- `tortoise-orm==0.23.0` - All business data (users, roles, keywords, tokens); paired with `aerich` for migrations
- `fastapi==0.111.0` - API layer; version-pinned for stability
- `APScheduler` - 6 scheduled tasks including ECS pipeline and IP alerting
- `alibabacloud_ecs20140526` - ECS node management; required for crawler scaling
- `uvicorn==0.34.0` + `uvloop==0.21.0` - Production ASGI server stack
- `pydantic==2.10.5` - All input validation; v2 API used throughout
- `loguru==0.7.3` - Unified logging across all modules
- `redis` - Optional distributed lock backend (`app/core/locks.py`; falls back to file locks if Redis unavailable)
## Configuration
- All settings in `app/settings/config.py` via `pydantic-settings.BaseSettings`
- Environment variables override defaults at startup
- No `.env` file detected (not committed); variables set at OS/container level
- Key variables: `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS`, `CLICKHOUSE_HOST`, `CLICKHOUSE_USER`, `CLICKHOUSE_PASS`, `SECRET_KEY`, `SMTP_HOST`, `SMTP_USER`, `SMTP_PASS`, `REDIS_HOST`, `REPORT_ENDPOINT`
- **Security warning**: `config.py` contains hardcoded default values for MySQL, ClickHouse, and SMTP credentials; must be overridden in production
- `pyproject.toml` - Python project metadata, black/ruff tool config
- `Pipfile` / `Pipfile.lock` - Development dependency management
- `requirements.txt` - Production pip install (used in Dockerfile)
- `web/vite.config.js` - Vite build config with proxy support via `VITE_USE_PROXY` env var
- `Dockerfile` - Multi-stage build: Node 18 (frontend) + Python 3.11 (backend) + nginx
## Platform Requirements
- Python 3.13+ (Pipfile requirement)
- Node.js 18+ with pnpm
- MySQL server (Tortoise-ORM connection)
- ClickHouse server (analytics data)
- Optional: Redis (distributed locking upgrade)
- Docker-based deployment (see `Dockerfile`, `deploy/entrypoint.sh`, `deploy/web.conf`)
- Nginx serves frontend static files and proxies API requests
- Alibaba Cloud ECS (cn-shanghai-b zone) for crawler nodes
- Ports: 80 (nginx), 9999 (FastAPI backend direct)
<!-- GSD:stack-end -->
<!-- GSD:conventions-start source:CONVENTIONS.md -->
## Conventions
## Naming Patterns
- Snake_case for all Python files: `company_storage.py`, `company_cleaner.py`, `clickhouse_repo.py`
- Private/internal modules prefixed with underscore: `_base.py`, `_boss_api.py`, `_boss_client.py`, `_boss_sign.py`, `_http_client.py`
- Platform-named service files: `boss.py`, `qcwy.py`, `zhilian.py` under `app/services/crawler/`
- Router files named after domain: `keyword.py`, `analytics.py`, `cleaning.py`
- PascalCase throughout: `CleaningService`, `KeywordController`, `ClickHouseBaseRepo`, `JobAnalyticsRepo`
- Services: `{Domain}Service``BossService`, `QcwyService`, `ZhilianService`, `IngestService`, `AnalyticsService`
- Controllers: `{Domain}Controller``KeywordController`
- Repos: `{Domain}Repo` or `{Domain}BaseRepo``ClickHouseBaseRepo`, `JobAnalyticsRepo`
- Models (Tortoise ORM): `{Platform}{Entity}``BossKeyword`, `QcwyCompany`, `ZhilianCompany`
- Schemas (Pydantic): `{Entity}Base`, `{Entity}Create`, `{Entity}Update`, `{Entity}Out` — see `app/schemas/keyword.py`
- Snake_case for all functions and methods: `get_available`, `report_page_progress`, `store_batch`, `build_insert_row`
- Private helpers prefixed with underscore: `_apply_proxy`, `_ensure_boss_token_loaded`, `_pick_first`, `_nested_get`, `_clean_text`, `_model_for_source`
- Async dependency factories follow pattern `get_{service/controller}()`: `get_ingest_service`, `get_analytics_service`, `get_keyword_controller`
- Snake_case: `data_list`, `platform_type`, `check_duplicate`, `page_size`
- Module-level constants: UPPER_SNAKE_CASE — `COMPANY_SOURCES`, `QUEUE_TERMINAL_STATUSES`
- Class-level constants: UPPER_SNAKE_CASE prefixed `_``_TOKEN_REFRESH_INTERVAL = 3600`
- Enums use PascalCase class name, UPPER_SNAKE_CASE values: `PlatformType.BOSS`, `ChannelType.MINI`, `DataType.JOB`
- Enum values are lowercase strings matching URL slugs: `"boss"`, `"mini"`, `"job"` — see `app/schemas/ingest.py`
- Enums inherit from `(str, Enum)` enabling direct string comparison
## Code Style
- Tool: `black` v24.10.0
- Line length: 120 characters (set in `pyproject.toml` `[tool.black]` and `[tool.ruff]`)
- Target Python versions: 3.10, 3.11 (black), 3.13 (Pipfile)
- Tool: `ruff` v0.9.1 (configured in `pyproject.toml`)
- Ignored rules: `F403` (star imports), `F405` (may be undefined from star import)
- Star imports from internal modules are allowed (used in `app/models/__init__.py`, `app/services/ingest/__init__.py`)
- Tool: `isort` v5.13.2
- No explicit isort config found; follows default ordering
## Import Organization
- None; all imports use full `app.` prefix paths
- `from app.log import logger` is the canonical loguru import path
- Used only in `__init__.py` re-export files: `from .admin import *` in `app/models/__init__.py`
- `# noqa: F401, F403` comments suppress lint warnings for intentional star imports
## Error Handling
- Services return `Dict[str, Any]` result objects with `"success"`, `"code"`, `"message"` fields instead of raising exceptions to callers
- Controllers return dict with `"code": 200/400/404` and `"message"` for all outcomes
- API route handlers do NOT use try/except — they rely on services returning structured results
- Service methods wrap low-level calls in `try/except Exception as e` and log then return `False` or error dict
## Logging
- `logger.info(f"...")` for normal operation events
- `logger.warning(f"...")` for non-fatal recoverable issues (e.g., token not found, API soft failures)
- `logger.error(f"...")` for caught exceptions and operation failures
- F-string interpolation used consistently for message formatting
- No structured fields (no `logger.bind()` usage observed)
## API Response Format
## Comments
- Docstrings on public methods describing purpose, not implementation: `"""获取可用关键词,优先返回断点续爬和失败重试的关键词"""`
- Inline comments for priority logic and algorithm steps: `# 优先级 1: 断点续爬 (partial)`
- Module-level docstrings for context: `"""Boss直聘 Service — 基于新算法文件的封装"""`
- `# noqa` comments for intentional lint suppressions
- Not applicable (Python backend)
- Docstrings are brief single-line or short multi-line Chinese descriptions
## Function Design
- Keyword arguments with defaults preferred for optional params
- Pydantic schemas used for HTTP request bodies (never raw dicts from router params)
- `Optional[str]` with `= None` default for optional parameters
- Services return `Dict[str, Any]` with consistent keys (`code`, `message`, `data`)
- Private helpers return `Optional[T]` or primitive types
- Async functions return awaitable results (no mixing of sync/async)
## Module Design
- `app/models/__init__.py` uses `from .{module} import *` to flatten model imports
- Router modules export a single named router variable: `router = APIRouter(...)` or `{domain}_router = APIRouter(...)`
- Service classes are imported directly by name
- `app/models/__init__.py` — re-exports all model classes
- `app/services/ingest/__init__.py` — re-exports `IngestService` and config registrations
- `app/api/v1/__init__.py` — aggregates all routers into `v1_router`
- FastAPI `Depends()` used for service/controller instantiation in route handlers
- Dependency factory functions named `get_{service}()` and defined in the same file as the router
- Shared auth dependencies: `DependAuth`, `DependPermission` in `app/core/dependency.py`
## Tortoise ORM Model Conventions
## Pydantic Schema Conventions
- All schemas inherit from `pydantic.BaseModel`
- All fields use `Field(...)` with `description=` for documentation
- Enums inherit from `(str, Enum)` for JSON serialization compatibility
- Output schemas include `class Config: from_attributes = True` to support ORM mode
- Validation patterns use `Field(..., pattern="^(boss|qcwy|zhilian)$")` for enum-like string fields
<!-- GSD:conventions-end -->
<!-- GSD:architecture-start source:ARCHITECTURE.md -->
## Architecture
## Pattern Overview
- FastAPI async backend with dual-database persistence (MySQL for RBAC/metadata, ClickHouse for job/company analytics data)
- External crawlers (`jobs_spider/`, `spiderJobs/`) run as independent processes and push data to the backend API via HTTP POST — no shared code path at runtime
- RBAC permission model enforced at router-level via FastAPI `Depends`
- APScheduler drives all recurring automation (cleaning, ECS orchestration, alerting) within the main process
- Registry pattern (`app/services/ingest/registry.py`) maps (platform, channel, data_type) tuples to ClickHouse table configs + dedup rules
## Layers
- Purpose: Parse HTTP request, validate via Pydantic schema, delegate to service or controller, return JSON envelope
- Location: `app/api/v1/`
- Contains: One sub-package per domain (e.g., `job/`, `cleaning/`, `analytics/`, `keyword/`)
- Depends on: Services, Controllers, `app/core/dependency.py` for auth
- Used by: FastAPI application via `app/api/v1/__init__.py``app/__init__.py`
- Purpose: Business logic, orchestration across repos/models, side-effect coordination
- Location: `app/services/`
- Contains: `IngestService`, `AnalyticsService`, `CleaningService`, `CompanyCleaner`, `CompanyJobsSyncService`, platform crawler service wrappers (`crawler/boss.py`, `crawler/qcwy.py`, `crawler/zhilian.py`)
- Depends on: Repositories, ORM models, `clickhouse_manager`, external HTTP clients
- Used by: Routers, APScheduler jobs
- Purpose: Thin CRUD orchestration for MySQL-backed entities (User, Role, API, Menu, Dept, etc.)
- Location: `app/controllers/`
- Contains: `user_controller`, `api_controller`, `role_controller`, `menu_controller`, `dept_controller`, `keyword_controller`, `proxy_controller`, `token_controller`, `cleaning_controller`, `company_controller`
- Depends on: Tortoise-ORM models
- Used by: Routers
- Purpose: Encapsulate raw ClickHouse query execution behind typed methods
- Location: `app/repositories/`
- Contains: `ClickHouseBaseRepo` (generic query/insert), `JobAnalyticsRepo` (trend, distribution, count queries against `job_analytics` view)
- Depends on: `clickhouse_connect.driver.AsyncClient`
- Used by: `AnalyticsService`
- Purpose: ORM schema definitions for MySQL, migrated via Aerich
- Location: `app/models/`
- Contains: `admin.py` (User, Role, Api, Menu, Dept, AuditLog), `token.py` (BossToken), `cleaning.py` (CleaningTask), `metrics.py` (ScheduledTaskRun, StatsTotal, IpUploadStats), `company.py` (CompanyCleaningQueue), `keyword.py` (BossKeyword, QcwyKeyword, ZhilianKeyword)
- Depends on: `tortoise-orm`
- Used by: Controllers, Services, Scheduler jobs
- Purpose: App wiring, connection management, middleware, cross-cutting utilities
- Location: `app/core/`
- Contains: `init_app.py` (lifespan bootstrap), `scheduler.py` (APScheduler jobs), `clickhouse.py` (ClickHouseManager singleton), `clickhouse_init.py` (ClickHouse DDL), `dependency.py` (AuthControl, PermissionControl), `locks.py` (DistributedLock), `middlewares.py` (BackGroundTaskMiddleware, HttpAuditLogMiddleware, IpTrackingMiddleware), `exceptions.py`, `crud.py`, `ip_tracking.py`
- Depends on: Everything below
- Used by: `app/__init__.py`, all layers above
- Purpose: Platform-agnostic data ingestion pipeline with registry-driven routing, dedup, and remote push
- Location: `app/services/ingest/`
- Contains: `service.py` (IngestService), `registry.py` (PlatformConfig registry), `dedup.py` (batch dedup filter), `remote_push.py` (async HTTP push to secondary target), `configs/` (per-platform PlatformConfig registrations)
- Depends on: `clickhouse_connect`, registry configs
- Used by: `app/api/v1/job/job.py` router
- Purpose: Wrap platform HTTP clients so backend scheduler can call crawlers directly (company cleaning, company search)
- Location: `app/services/crawler/`
- Contains: `_base.py` (BaseFetcher, BaseSearcher, ApiResult), `_http_client.py`, `_boss_client.py`, `_boss_api.py`, `_boss_sign.py`, `_zhilian_client.py`, `_zhilian_api.py`, `_zhilian_sign.py`, `_job51_client.py`, `_job51_api.py`, `_job51_sign.py`, `boss.py`, `qcwy.py`, `zhilian.py`
- Depends on: `requests` (sync), platform sign/auth modules
- Used by: `CompanyCleaner`, `CompanyController`
- Purpose: Standalone crawlers running on ECS nodes; push data to backend via `POST /api/v1/universal/data/batch-store-async`
- Location: `jobs_spider/boss/boos_api.py`, `jobs_spider/qcwy/qcwy.py`, `jobs_spider/zhilian/zhilian_single.py`, `spiderJobs/`
- Depends on: `requests`, `loguru`, `PyExecJS`; no shared Python imports with `app/`
- Used by: Scheduled via `ecs_full_pipeline.py` or manual launch
## Data Flow
- MySQL (via Tortoise-ORM): User sessions, RBAC entities, keyword crawl state, tokens, task run records
- ClickHouse: All raw job/company JSON data + analytics views; no ORM — raw SQL via `clickhouse_connect`
- Context variable `CTX_USER_ID` (Python `contextvars.ContextVar`) carries authenticated user ID within request scope (`app/core/ctx.py`)
- File-based startup lock (`.startup_lock/` dir) prevents multi-worker seed data collisions
## Key Abstractions
- Purpose: Maps `(platform, channel, data_type)` to target ClickHouse table and dedup field extractors
- Examples: `app/services/ingest/registry.py`, `app/services/ingest/configs/`
- Pattern: Immutable `@dataclass(frozen=True)`, registered at module import time via `register()`
- Purpose: Prevents concurrent execution of scheduled jobs across multiple Uvicorn workers
- Examples: `app/core/locks.py`
- Pattern: Redis SET NX with TTL; falls back to filesystem directory creation with TTL metadata
- Purpose: Singleton lazy-init connection wrapper around `clickhouse_connect.AsyncClient`
- Examples: `app/core/clickhouse.py`
- Pattern: Global module-level instance `clickhouse_manager`, injected into services at call time via `await clickhouse_manager.get_client()`
- Purpose: Abstract base for platform-specific API clients within the in-process crawler wrappers
- Examples: `app/services/crawler/_base.py`
- Pattern: Template method — subclasses implement `_build_params()`, base handles HTTP invocation and `ApiResult` parsing
- Purpose: Common ORM base with `BigIntField pk`, `to_dict()` async serializer with M2M support
- Examples: `app/models/base.py`
- Pattern: Mixed-in with `TimestampMixin` (created_at, updated_at auto fields)
## Entry Points
- Location: `run.py`
- Triggers: `python run.py` — reads `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS` env vars; calls `uvicorn.run("app:app")`
- Responsibilities: Configure uvicorn logging format, start ASGI server
- Location: `app/__init__.py``create_app()`
- Triggers: Uvicorn import of `app:app`
- Responsibilities: Register middlewares, exception handlers, routers, static files; define lifespan
- Location: `app/__init__.py` (lifespan context manager) + `app/core/init_app.py` (`init_data()`)
- Triggers: FastAPI startup event
- Responsibilities: Tortoise-ORM init + schema generation → Aerich migrations → seed data (superuser, menus, APIs, roles) → ClickHouse table init → APScheduler start
- Location: `app/api/v1/__init__.py`
- Triggers: Imported by `app/api/__init__.py``app/__init__.py`
- Responsibilities: Mount all domain routers under `/api/v1/` with appropriate `DependPermission` guards
- Location: `ecs_full_pipeline.py`
- Triggers: Manual execution or scheduled via `scheduler.py` (`ecs_full_pipeline_job`) every 6 hours
- Responsibilities: Create Alibaba Cloud ECS spot instances, push spider scripts, execute crawlers, terminate instances
## Error Handling
- Tortoise `DoesNotExist` → 404 via `DoesNotExistHandle` (registered in `app/core/init_app.py`)
- Pydantic `RequestValidationError` → 422 with field-level detail via `RequestValidationHandle`
- `IntegrityError` (DB constraint) → 400 via `IntegrityHandle`
- Custom `HTTPException` (from `app/core/exceptions.py`) → structured JSON error via `HttpExcHandle`
- Service layer exceptions logged with `loguru` then re-raised or returned as `{"success": False, "error": "..."}`
- Scheduled jobs catch all exceptions, record failure via `_record_task_run()` to `ScheduledTaskRun` table, then continue
## Cross-Cutting Concerns
<!-- GSD:architecture-end -->
<!-- GSD:workflow-start source:GSD defaults -->
## GSD Workflow Enforcement
Before using Edit, Write, or other file-changing tools, start work through a GSD command so planning artifacts and execution context stay in sync.
Use these entry points:
- `/gsd:quick` for small fixes, doc updates, and ad-hoc tasks
- `/gsd:debug` for investigation and bug fixing
- `/gsd:execute-phase` for planned phase work
Do not make direct repo edits outside a GSD workflow unless the user explicitly asks to bypass it.
<!-- GSD:workflow-end -->
<!-- GSD:profile-start -->
## Developer Profile
> Profile not yet configured. Run `/gsd:profile-user` to generate your developer profile.
> This section is managed by `generate-claude-profile` -- do not edit manually.
<!-- GSD:profile-end -->