# JobData - 招聘数据采集与统计分析平台 ## 变更记录 (Changelog) | 版本 | 日期 | 说明 | |------|------|------| | 初始化 | 2026-03-20 | 首次生成架构文档,覆盖全部四个核心模块 | --- ## 项目愿景 JobData 是一个面向招聘市场的全栈数据采集与分析平台。系统从三大主流招聘平台(Boss 直聘、前程无忧、智联招聘)自动抓取职位与公司数据,统一存储到 ClickHouse 列式数据库,并通过 FastAPI 后端 + Vue3 前端提供数据查看、定向清洗、统计分析等能力。ECS 弹性实例管理模块支持在阿里云上按需批量启停爬虫节点。 --- ## 架构总览 ``` [招聘平台] [爬虫层] [后端 API] [数据库] Boss直聘 ──► jobs_spider/boss/ ──► ┌── MySQL 前程无忧 ──► jobs_spider/qcwy/ ──► app (FastAPI) ──► └── ClickHouse 智联招聘 ──► jobs_spider/zhilian/ ──► ▲ │ web (Vue3) ────────┘ (前端页面) ecs_full_pipeline.py ──► 阿里云 ECS ──► 批量启动爬虫节点 ``` **核心技术栈** | 层次 | 技术 | |------|------| | 后端 | Python 3.13, FastAPI 0.111, Tortoise-ORM 0.23, APScheduler | | 数据库(业务) | MySQL(用户/权限/审计/关键词/Token) | | 数据库(采集) | ClickHouse(职位/公司 JSON 原始数据 + 分析视图) | | 爬虫 | requests / httpx / Playwright(Python 脚本) | | 前端 | Vue 3.3, Vite 4, Naive UI, Pinia, ECharts | | 基础设施 | 阿里云 ECS(按量抢占实例),APScheduler 定时任务 | --- ## 模块结构图 ```mermaid graph TD ROOT["(根) JobData"] --> APP["app - FastAPI 后端"] ROOT --> WEB["web - Vue3 前端"] ROOT --> SPIDER["jobs_spider - 平台爬虫"] ROOT --> ECS["ecs_full_pipeline.py - ECS 批量部署"] APP --> APP_API["app/api - 路由层"] APP --> APP_SVC["app/services - 业务逻辑"] APP --> APP_CORE["app/core - 框架核心"] APP --> APP_MODELS["app/models - ORM 模型"] APP --> APP_REPO["app/repositories - 数据仓库"] SPIDER --> SP_BOSS["jobs_spider/boss"] SPIDER --> SP_QCWY["jobs_spider/qcwy"] SPIDER --> SP_ZL["jobs_spider/zhilian"] click APP "./app/CLAUDE.md" "查看 app 模块文档" click WEB "./web/CLAUDE.md" "查看 web 模块文档" click SPIDER "./jobs_spider/CLAUDE.md" "查看 jobs_spider 模块文档" ``` --- ## 模块索引 | 模块路径 | 语言 | 职责简述 | |----------|------|----------| | `app/` | Python | FastAPI 后端,提供 REST API、权限管理、定时任务、数据入库与分析 | | `web/` | Vue3/JS | 前端管理界面,数据展示、关键词管理、代理管理、数据清洗操作 | | `jobs_spider/` | Python | 三大平台的爬虫脚本,独立运行,结果通过 HTTP 推送到后端 | | `ecs_full_pipeline.py` | Python | 阿里云 ECS 实例批量创建/销毁/命令下发全流程脚本 | | `reclean_qcwy_jobs.py` | Python | 前程无忧数据重清洗独立脚本 | --- ## 运行与开发 ### 后端启动 ```bash # 安装依赖(pipenv) pipenv install # 开发模式(默认端口 9999,20 个 worker) python run.py # 环境变量覆盖 APP_HOST=0.0.0.0 APP_PORT=9999 UVICORN_WORKERS=4 python run.py ``` **关键环境变量** | 变量 | 默认值 | 说明 | |------|--------|------| | `APP_HOST` | `0.0.0.0` | 监听地址 | | `APP_PORT` | `9999` | 监听端口 | | `UVICORN_WORKERS` | `20` | Worker 数量 | | `CLICKHOUSE_HOST` | `121.4.126.241` | ClickHouse 地址(需修改为实际地址) | | `CLICKHOUSE_USER` / `CLICKHOUSE_PASS` | 见 config.py | ClickHouse 认证 | | `SMTP_HOST` / `SMTP_USER` / `SMTP_PASS` | 见 config.py | 邮件告警配置 | | `REPORT_ENDPOINT` | 空 | 统计结果 Webhook 上报地址 | | `RUN_MIGRATIONS_ON_STARTUP` | `True` | 是否启动时自动迁移 | | `INITIALIZE_SEED_DATA_ON_STARTUP` | `True` | 是否启动时初始化种子数据 | > 安全警告:`config.py` 中 `SECRET_KEY`、数据库连接串、SMTP 密码均为硬编码默认值,生产环境必须通过环境变量覆盖。 ### 前端启动 ```bash cd web pnpm install pnpm dev # 开发模式,默认 http://localhost:5173 pnpm build # 构建产物到 web/dist ``` ### ECS 批量爬虫部署 ```bash # 需配置阿里云凭据(环境变量或 ~/.alibabacloud/credentials) python ecs_full_pipeline.py ``` --- ## 定时任务 APScheduler 在应用启动时注册以下任务(`app/core/scheduler.py`): | 任务 ID | 频率 | 职责 | |---------|------|------| | `stats_job` | 每 6 小时 | 统计 ClickHouse 各表总量并通过邮件/Webhook 上报 | | `ecs_full_pipeline` | 每 6 小时 | 调用 `ecs_full_pipeline.py` 批量刷新爬虫节点 | | `ip_alert_job` | 每 10 分钟 | 检查 IP 上报异常并告警 | | `company_cleaning_job` | 每 5 分钟 | 自动清洗待处理公司数据(collect 50 + process 30) | | `daily_cleanup_job` | 每天 00:05 | 清理历史任务运行记录 | 所有任务通过分布式文件锁(或可选 Redis 锁)保证多 Worker 下只执行一次。 --- ## 测试策略 - 当前代码库**无自动化测试文件**(缺口:单元测试、集成测试均缺失)。 - 推荐补充: 1. `app/services/` 的 service 层单元测试(使用 `pytest` + `anyio`) 2. `app/api/v1/` 的 API 集成测试(使用 `httpx.AsyncClient`) 3. `jobs_spider/` 的数据解析函数单元测试 --- ## 编码规范 - Python:使用 `ruff`(已在 Pipfile 中),格式化用 `black`,排序用 `isort`。 - 前端:ESLint(`@zclzone` + `@unocss` 规则集),`prettier` 格式化。 - 类型:后端强制 `pydantic` Schema 做入参校验;前端以 JS 为主(未启用严格 TS)。 - 日志:后端统一使用 `loguru`,结构化字段 `logger.info(...)` 方式输出。 --- ## AI 使用指引 - 修改爬虫逻辑时,重点关注反爬机制:`SmartIPManager`、`IPAnomalyDetector` 在 `jobs_spider/boss/boos_api.py` 中实现,随机延迟至少 10 秒。 - 新增 API 路由后需同步在 `app/api/v1/__init__.py` 注册,并执行 `api_controller.refresh_api()` 更新权限表。 - ClickHouse 表结构变更在 `app/core/clickhouse_init.py` 中维护,**不走 Aerich 迁移**。 - MySQL 模型变更走 Aerich(`aerich migrate && aerich upgrade`)。 - 前端新增页面需要在 `web/src/views/{模块}/route.js` 和后端 `init_menus()` 中同步注册菜单。 - `config.py` 中已硬编码真实 MySQL/ClickHouse 连接串和 SMTP 凭据,**提交代码前务必确认不泄露敏感信息**。 ## Project **JobData 爬虫交互重构** 招聘数据采集平台的爬虫交互层重构项目。重写三个平台(Boss直聘、前程无忧、智联招聘)的爬虫代码,统一基类和共享核心逻辑,优化数据入库去重、公司信息清洗流程和前端监控页面。 **Core Value:** 基于关键词驱动爬虫抓取职位数据,可靠地入库到 ClickHouse,并通过定时任务完成公司信息的采集和同步。 ### Constraints - **兼容性**: 重构期间不能中断现有爬虫数据采集 - **共享核心**: 外部脚本和后端必须用同一套签名/请求/解析代码,避免代码重复 - **数据库**: ClickHouse 表结构保持不变,MySQL 模型变更走 Aerich 迁移 - **反爬**: 必须保留现有反爬机制(IP 轮换、随机延迟、代理管理) ## Technology Stack ## Languages - Python 3.13 - Backend API, crawlers, ECS pipeline scripts - JavaScript (ES Modules) - Vue3 frontend (no strict TypeScript; TS installed but JS used in `.vue` and `.js` files) - TypeScript 5.1.6 - Type definitions referenced in frontend toolchain only ## Runtime - Python 3.13 (required: `>=3.13` per `Pipfile`; Dockerfile uses `python:3.11-slim-bullseye` for container build) - Node.js 18 (Dockerfile `node:18-alpine` for frontend build stage) - Python: `pipenv` (development), `pip` / `requirements.txt` (Docker production) - Frontend: `pnpm` (lockfile: `web/pnpm-lock.yaml` present) - Lockfiles: `Pipfile.lock` (Python), `web/pnpm-lock.yaml` (frontend), `uv.lock` (uv-compatible) ## Frameworks - FastAPI 0.111.0 - REST API framework (`app/__init__.py` factory via `create_app()`) - Starlette 0.37.2 - ASGI underpinning (middleware, static files) - Uvicorn 0.34.0 - ASGI server (20 workers default, configurable via `UVICORN_WORKERS`) - Uvloop 0.21.0 - High-performance event loop (non-Windows only) - Tortoise-ORM 0.23.0 - Async ORM for MySQL (`app/models/`) - Aerich 0.8.1 - Database migrations for Tortoise-ORM (`migrations/` directory) - aiomysql - Async MySQL driver (Tortoise backend) - clickhouse-connect 0.7.19 - ClickHouse async client (`app/core/clickhouse.py`) - asyncpg - Async PostgreSQL driver (listed in Pipfile but not actively used) - Pydantic 2.10.5 - Request/response schemas (`app/schemas/`) - pydantic-settings 2.7.1 - Settings management via env vars (`app/settings/config.py`) - orjson 3.10.14 - Fast JSON serialization - ujson 5.10.0 - Alternative JSON library - PyJWT 2.10.1 - JWT token generation/verification (HS256, 7-day expiry) - passlib 1.7.4 - Password hashing - argon2-cffi 23.1.0 - Argon2 password hasher - APScheduler - Async job scheduler (`app/core/scheduler.py`, 6 registered cron tasks) - httpx 0.28.1 - Async HTTP client (backend service-to-service calls) - requests - Sync HTTP client (spider scripts in `jobs_spider/`) - loguru 0.7.3 - Structured logging (unified throughout backend and spiders) - Vue 3.3.4 - SPA framework (`web/src/main.js`) - Vue Router 4.2.4 - Client-side routing (`web/src/router/index.js`) - Pinia 2.1.6 - State management (`web/src/store/`) - Naive UI 2.34.4 - Component library (admin UI) - ECharts 6.0.0 - Data visualization charts (`web/src/views/analytics/`) - axios 1.4.0 - HTTP client (`web/src/utils/http/`) - vue-i18n 9 - Internationalization - Vite 4.4.6 - Frontend bundler (`web/vite.config.js`) - UnoCSS 66.5.10 - Atomic CSS engine - unplugin-auto-import 20.3.0 - Auto-imports for Vue APIs - unplugin-vue-components 30.0.0 - Auto-imports for components - @iconify/vue + @iconify/json - Icon library - playwright 1.57.0 - Browser automation (anti-detection in crawlers) - PyExecJS 1.5.1 - Execute JavaScript signing algorithms from Python (`jobs_spider/boss/`) - PySocks - SOCKS proxy support for spider requests - tenacity - Retry logic in crawlers - pandas + openpyxl - Data processing and Excel export - alibabacloud_ecs20140526 - Alibaba Cloud ECS SDK (`ecs_full_pipeline.py`) - alibabacloud_credentials - AliCloud credential management - ruff 0.9.1 - Python linter (`pyproject.toml` config: line-length 120, ignores F403/F405) - black 24.10.0 - Python formatter (line-length 120, target py310/py311) - isort 5.13.2 - Import sorter - ESLint 8.46.0 - Frontend linter (`@zclzone` + `@unocss` rule sets) - prettier - Frontend formatter ## Key Dependencies - `clickhouse-connect==0.7.19` - All analytics and job data storage; loss means no data read/write - `tortoise-orm==0.23.0` - All business data (users, roles, keywords, tokens); paired with `aerich` for migrations - `fastapi==0.111.0` - API layer; version-pinned for stability - `APScheduler` - 6 scheduled tasks including ECS pipeline and IP alerting - `alibabacloud_ecs20140526` - ECS node management; required for crawler scaling - `uvicorn==0.34.0` + `uvloop==0.21.0` - Production ASGI server stack - `pydantic==2.10.5` - All input validation; v2 API used throughout - `loguru==0.7.3` - Unified logging across all modules - `redis` - Optional distributed lock backend (`app/core/locks.py`; falls back to file locks if Redis unavailable) ## Configuration - All settings in `app/settings/config.py` via `pydantic-settings.BaseSettings` - Environment variables override defaults at startup - No `.env` file detected (not committed); variables set at OS/container level - Key variables: `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS`, `CLICKHOUSE_HOST`, `CLICKHOUSE_USER`, `CLICKHOUSE_PASS`, `SECRET_KEY`, `SMTP_HOST`, `SMTP_USER`, `SMTP_PASS`, `REDIS_HOST`, `REPORT_ENDPOINT` - **Security warning**: `config.py` contains hardcoded default values for MySQL, ClickHouse, and SMTP credentials; must be overridden in production - `pyproject.toml` - Python project metadata, black/ruff tool config - `Pipfile` / `Pipfile.lock` - Development dependency management - `requirements.txt` - Production pip install (used in Dockerfile) - `web/vite.config.js` - Vite build config with proxy support via `VITE_USE_PROXY` env var - `Dockerfile` - Multi-stage build: Node 18 (frontend) + Python 3.11 (backend) + nginx ## Platform Requirements - Python 3.13+ (Pipfile requirement) - Node.js 18+ with pnpm - MySQL server (Tortoise-ORM connection) - ClickHouse server (analytics data) - Optional: Redis (distributed locking upgrade) - Docker-based deployment (see `Dockerfile`, `deploy/entrypoint.sh`, `deploy/web.conf`) - Nginx serves frontend static files and proxies API requests - Alibaba Cloud ECS (cn-shanghai-b zone) for crawler nodes - Ports: 80 (nginx), 9999 (FastAPI backend direct) ## Conventions ## Naming Patterns - Snake_case for all Python files: `company_storage.py`, `company_cleaner.py`, `clickhouse_repo.py` - Private/internal modules prefixed with underscore: `_base.py`, `_boss_api.py`, `_boss_client.py`, `_boss_sign.py`, `_http_client.py` - Platform-named service files: `boss.py`, `qcwy.py`, `zhilian.py` under `app/services/crawler/` - Router files named after domain: `keyword.py`, `analytics.py`, `cleaning.py` - PascalCase throughout: `CleaningService`, `KeywordController`, `ClickHouseBaseRepo`, `JobAnalyticsRepo` - Services: `{Domain}Service` — `BossService`, `QcwyService`, `ZhilianService`, `IngestService`, `AnalyticsService` - Controllers: `{Domain}Controller` — `KeywordController` - Repos: `{Domain}Repo` or `{Domain}BaseRepo` — `ClickHouseBaseRepo`, `JobAnalyticsRepo` - Models (Tortoise ORM): `{Platform}{Entity}` — `BossKeyword`, `QcwyCompany`, `ZhilianCompany` - Schemas (Pydantic): `{Entity}Base`, `{Entity}Create`, `{Entity}Update`, `{Entity}Out` — see `app/schemas/keyword.py` - Snake_case for all functions and methods: `get_available`, `report_page_progress`, `store_batch`, `build_insert_row` - Private helpers prefixed with underscore: `_apply_proxy`, `_ensure_boss_token_loaded`, `_pick_first`, `_nested_get`, `_clean_text`, `_model_for_source` - Async dependency factories follow pattern `get_{service/controller}()`: `get_ingest_service`, `get_analytics_service`, `get_keyword_controller` - Snake_case: `data_list`, `platform_type`, `check_duplicate`, `page_size` - Module-level constants: UPPER_SNAKE_CASE — `COMPANY_SOURCES`, `QUEUE_TERMINAL_STATUSES` - Class-level constants: UPPER_SNAKE_CASE prefixed `_` — `_TOKEN_REFRESH_INTERVAL = 3600` - Enums use PascalCase class name, UPPER_SNAKE_CASE values: `PlatformType.BOSS`, `ChannelType.MINI`, `DataType.JOB` - Enum values are lowercase strings matching URL slugs: `"boss"`, `"mini"`, `"job"` — see `app/schemas/ingest.py` - Enums inherit from `(str, Enum)` enabling direct string comparison ## Code Style - Tool: `black` v24.10.0 - Line length: 120 characters (set in `pyproject.toml` `[tool.black]` and `[tool.ruff]`) - Target Python versions: 3.10, 3.11 (black), 3.13 (Pipfile) - Tool: `ruff` v0.9.1 (configured in `pyproject.toml`) - Ignored rules: `F403` (star imports), `F405` (may be undefined from star import) - Star imports from internal modules are allowed (used in `app/models/__init__.py`, `app/services/ingest/__init__.py`) - Tool: `isort` v5.13.2 - No explicit isort config found; follows default ordering ## Import Organization - None; all imports use full `app.` prefix paths - `from app.log import logger` is the canonical loguru import path - Used only in `__init__.py` re-export files: `from .admin import *` in `app/models/__init__.py` - `# noqa: F401, F403` comments suppress lint warnings for intentional star imports ## Error Handling - Services return `Dict[str, Any]` result objects with `"success"`, `"code"`, `"message"` fields instead of raising exceptions to callers - Controllers return dict with `"code": 200/400/404` and `"message"` for all outcomes - API route handlers do NOT use try/except — they rely on services returning structured results - Service methods wrap low-level calls in `try/except Exception as e` and log then return `False` or error dict ## Logging - `logger.info(f"...")` for normal operation events - `logger.warning(f"...")` for non-fatal recoverable issues (e.g., token not found, API soft failures) - `logger.error(f"...")` for caught exceptions and operation failures - F-string interpolation used consistently for message formatting - No structured fields (no `logger.bind()` usage observed) ## API Response Format ## Comments - Docstrings on public methods describing purpose, not implementation: `"""获取可用关键词,优先返回断点续爬和失败重试的关键词"""` - Inline comments for priority logic and algorithm steps: `# 优先级 1: 断点续爬 (partial)` - Module-level docstrings for context: `"""Boss直聘 Service — 基于新算法文件的封装"""` - `# noqa` comments for intentional lint suppressions - Not applicable (Python backend) - Docstrings are brief single-line or short multi-line Chinese descriptions ## Function Design - Keyword arguments with defaults preferred for optional params - Pydantic schemas used for HTTP request bodies (never raw dicts from router params) - `Optional[str]` with `= None` default for optional parameters - Services return `Dict[str, Any]` with consistent keys (`code`, `message`, `data`) - Private helpers return `Optional[T]` or primitive types - Async functions return awaitable results (no mixing of sync/async) ## Module Design - `app/models/__init__.py` uses `from .{module} import *` to flatten model imports - Router modules export a single named router variable: `router = APIRouter(...)` or `{domain}_router = APIRouter(...)` - Service classes are imported directly by name - `app/models/__init__.py` — re-exports all model classes - `app/services/ingest/__init__.py` — re-exports `IngestService` and config registrations - `app/api/v1/__init__.py` — aggregates all routers into `v1_router` - FastAPI `Depends()` used for service/controller instantiation in route handlers - Dependency factory functions named `get_{service}()` and defined in the same file as the router - Shared auth dependencies: `DependAuth`, `DependPermission` in `app/core/dependency.py` ## Tortoise ORM Model Conventions ## Pydantic Schema Conventions - All schemas inherit from `pydantic.BaseModel` - All fields use `Field(...)` with `description=` for documentation - Enums inherit from `(str, Enum)` for JSON serialization compatibility - Output schemas include `class Config: from_attributes = True` to support ORM mode - Validation patterns use `Field(..., pattern="^(boss|qcwy|zhilian)$")` for enum-like string fields ## Architecture ## Pattern Overview - FastAPI async backend with dual-database persistence (MySQL for RBAC/metadata, ClickHouse for job/company analytics data) - External crawlers (`jobs_spider/`, `spiderJobs/`) run as independent processes and push data to the backend API via HTTP POST — no shared code path at runtime - RBAC permission model enforced at router-level via FastAPI `Depends` - APScheduler drives all recurring automation (cleaning, ECS orchestration, alerting) within the main process - Registry pattern (`app/services/ingest/registry.py`) maps (platform, channel, data_type) tuples to ClickHouse table configs + dedup rules ## Layers - Purpose: Parse HTTP request, validate via Pydantic schema, delegate to service or controller, return JSON envelope - Location: `app/api/v1/` - Contains: One sub-package per domain (e.g., `job/`, `cleaning/`, `analytics/`, `keyword/`) - Depends on: Services, Controllers, `app/core/dependency.py` for auth - Used by: FastAPI application via `app/api/v1/__init__.py` → `app/__init__.py` - Purpose: Business logic, orchestration across repos/models, side-effect coordination - Location: `app/services/` - Contains: `IngestService`, `AnalyticsService`, `CleaningService`, `CompanyCleaner`, `CompanyJobsSyncService`, platform crawler service wrappers (`crawler/boss.py`, `crawler/qcwy.py`, `crawler/zhilian.py`) - Depends on: Repositories, ORM models, `clickhouse_manager`, external HTTP clients - Used by: Routers, APScheduler jobs - Purpose: Thin CRUD orchestration for MySQL-backed entities (User, Role, API, Menu, Dept, etc.) - Location: `app/controllers/` - Contains: `user_controller`, `api_controller`, `role_controller`, `menu_controller`, `dept_controller`, `keyword_controller`, `proxy_controller`, `token_controller`, `cleaning_controller`, `company_controller` - Depends on: Tortoise-ORM models - Used by: Routers - Purpose: Encapsulate raw ClickHouse query execution behind typed methods - Location: `app/repositories/` - Contains: `ClickHouseBaseRepo` (generic query/insert), `JobAnalyticsRepo` (trend, distribution, count queries against `job_analytics` view) - Depends on: `clickhouse_connect.driver.AsyncClient` - Used by: `AnalyticsService` - Purpose: ORM schema definitions for MySQL, migrated via Aerich - Location: `app/models/` - Contains: `admin.py` (User, Role, Api, Menu, Dept, AuditLog), `token.py` (BossToken), `cleaning.py` (CleaningTask), `metrics.py` (ScheduledTaskRun, StatsTotal, IpUploadStats), `company.py` (CompanyCleaningQueue), `keyword.py` (BossKeyword, QcwyKeyword, ZhilianKeyword) - Depends on: `tortoise-orm` - Used by: Controllers, Services, Scheduler jobs - Purpose: App wiring, connection management, middleware, cross-cutting utilities - Location: `app/core/` - Contains: `init_app.py` (lifespan bootstrap), `scheduler.py` (APScheduler jobs), `clickhouse.py` (ClickHouseManager singleton), `clickhouse_init.py` (ClickHouse DDL), `dependency.py` (AuthControl, PermissionControl), `locks.py` (DistributedLock), `middlewares.py` (BackGroundTaskMiddleware, HttpAuditLogMiddleware, IpTrackingMiddleware), `exceptions.py`, `crud.py`, `ip_tracking.py` - Depends on: Everything below - Used by: `app/__init__.py`, all layers above - Purpose: Platform-agnostic data ingestion pipeline with registry-driven routing, dedup, and remote push - Location: `app/services/ingest/` - Contains: `service.py` (IngestService), `registry.py` (PlatformConfig registry), `dedup.py` (batch dedup filter), `remote_push.py` (async HTTP push to secondary target), `configs/` (per-platform PlatformConfig registrations) - Depends on: `clickhouse_connect`, registry configs - Used by: `app/api/v1/job/job.py` router - Purpose: Wrap platform HTTP clients so backend scheduler can call crawlers directly (company cleaning, company search) - Location: `app/services/crawler/` - Contains: `_base.py` (BaseFetcher, BaseSearcher, ApiResult), `_http_client.py`, `_boss_client.py`, `_boss_api.py`, `_boss_sign.py`, `_zhilian_client.py`, `_zhilian_api.py`, `_zhilian_sign.py`, `_job51_client.py`, `_job51_api.py`, `_job51_sign.py`, `boss.py`, `qcwy.py`, `zhilian.py` - Depends on: `requests` (sync), platform sign/auth modules - Used by: `CompanyCleaner`, `CompanyController` - Purpose: Standalone crawlers running on ECS nodes; push data to backend via `POST /api/v1/universal/data/batch-store-async` - Location: `jobs_spider/boss/boos_api.py`, `jobs_spider/qcwy/qcwy.py`, `jobs_spider/zhilian/zhilian_single.py`, `spiderJobs/` - Depends on: `requests`, `loguru`, `PyExecJS`; no shared Python imports with `app/` - Used by: Scheduled via `ecs_full_pipeline.py` or manual launch ## Data Flow - MySQL (via Tortoise-ORM): User sessions, RBAC entities, keyword crawl state, tokens, task run records - ClickHouse: All raw job/company JSON data + analytics views; no ORM — raw SQL via `clickhouse_connect` - Context variable `CTX_USER_ID` (Python `contextvars.ContextVar`) carries authenticated user ID within request scope (`app/core/ctx.py`) - File-based startup lock (`.startup_lock/` dir) prevents multi-worker seed data collisions ## Key Abstractions - Purpose: Maps `(platform, channel, data_type)` to target ClickHouse table and dedup field extractors - Examples: `app/services/ingest/registry.py`, `app/services/ingest/configs/` - Pattern: Immutable `@dataclass(frozen=True)`, registered at module import time via `register()` - Purpose: Prevents concurrent execution of scheduled jobs across multiple Uvicorn workers - Examples: `app/core/locks.py` - Pattern: Redis SET NX with TTL; falls back to filesystem directory creation with TTL metadata - Purpose: Singleton lazy-init connection wrapper around `clickhouse_connect.AsyncClient` - Examples: `app/core/clickhouse.py` - Pattern: Global module-level instance `clickhouse_manager`, injected into services at call time via `await clickhouse_manager.get_client()` - Purpose: Abstract base for platform-specific API clients within the in-process crawler wrappers - Examples: `app/services/crawler/_base.py` - Pattern: Template method — subclasses implement `_build_params()`, base handles HTTP invocation and `ApiResult` parsing - Purpose: Common ORM base with `BigIntField pk`, `to_dict()` async serializer with M2M support - Examples: `app/models/base.py` - Pattern: Mixed-in with `TimestampMixin` (created_at, updated_at auto fields) ## Entry Points - Location: `run.py` - Triggers: `python run.py` — reads `APP_HOST`, `APP_PORT`, `UVICORN_WORKERS` env vars; calls `uvicorn.run("app:app")` - Responsibilities: Configure uvicorn logging format, start ASGI server - Location: `app/__init__.py` → `create_app()` - Triggers: Uvicorn import of `app:app` - Responsibilities: Register middlewares, exception handlers, routers, static files; define lifespan - Location: `app/__init__.py` (lifespan context manager) + `app/core/init_app.py` (`init_data()`) - Triggers: FastAPI startup event - Responsibilities: Tortoise-ORM init + schema generation → Aerich migrations → seed data (superuser, menus, APIs, roles) → ClickHouse table init → APScheduler start - Location: `app/api/v1/__init__.py` - Triggers: Imported by `app/api/__init__.py` → `app/__init__.py` - Responsibilities: Mount all domain routers under `/api/v1/` with appropriate `DependPermission` guards - Location: `ecs_full_pipeline.py` - Triggers: Manual execution or scheduled via `scheduler.py` (`ecs_full_pipeline_job`) every 6 hours - Responsibilities: Create Alibaba Cloud ECS spot instances, push spider scripts, execute crawlers, terminate instances ## Error Handling - Tortoise `DoesNotExist` → 404 via `DoesNotExistHandle` (registered in `app/core/init_app.py`) - Pydantic `RequestValidationError` → 422 with field-level detail via `RequestValidationHandle` - `IntegrityError` (DB constraint) → 400 via `IntegrityHandle` - Custom `HTTPException` (from `app/core/exceptions.py`) → structured JSON error via `HttpExcHandle` - Service layer exceptions logged with `loguru` then re-raised or returned as `{"success": False, "error": "..."}` - Scheduled jobs catch all exceptions, record failure via `_record_task_run()` to `ScheduledTaskRun` table, then continue ## Cross-Cutting Concerns ## GSD Workflow Enforcement Before using Edit, Write, or other file-changing tools, start work through a GSD command so planning artifacts and execution context stay in sync. Use these entry points: - `/gsd:quick` for small fixes, doc updates, and ad-hoc tasks - `/gsd:debug` for investigation and bug fixing - `/gsd:execute-phase` for planned phase work Do not make direct repo edits outside a GSD workflow unless the user explicitly asks to bypass it. ## Developer Profile > Profile not yet configured. Run `/gsd:profile-user` to generate your developer profile. > This section is managed by `generate-claude-profile` -- do not edit manually.