28 KiB
JobData - 招聘数据采集与统计分析平台
变更记录 (Changelog)
| 版本 | 日期 | 说明 |
|---|---|---|
| 初始化 | 2026-03-20 | 首次生成架构文档,覆盖全部四个核心模块 |
项目愿景
JobData 是一个面向招聘市场的全栈数据采集与分析平台。系统从三大主流招聘平台(Boss 直聘、前程无忧、智联招聘)自动抓取职位与公司数据,统一存储到 ClickHouse 列式数据库,并通过 FastAPI 后端 + Vue3 前端提供数据查看、定向清洗、统计分析等能力。ECS 弹性实例管理模块支持在阿里云上按需批量启停爬虫节点。
架构总览
[招聘平台] [爬虫层] [后端 API] [数据库]
Boss直聘 ──► jobs_spider/boss/ ──► ┌── MySQL
前程无忧 ──► jobs_spider/qcwy/ ──► app (FastAPI) ──► └── ClickHouse
智联招聘 ──► jobs_spider/zhilian/ ──►
▲
│
web (Vue3) ────────┘
(前端页面)
ecs_full_pipeline.py ──► 阿里云 ECS ──► 批量启动爬虫节点
核心技术栈
| 层次 | 技术 |
|---|---|
| 后端 | Python 3.13, FastAPI 0.111, Tortoise-ORM 0.23, APScheduler |
| 数据库(业务) | MySQL(用户/权限/审计/关键词/Token) |
| 数据库(采集) | ClickHouse(职位/公司 JSON 原始数据 + 分析视图) |
| 爬虫 | requests / httpx / Playwright(Python 脚本) |
| 前端 | Vue 3.3, Vite 4, Naive UI, Pinia, ECharts |
| 基础设施 | 阿里云 ECS(按量抢占实例),APScheduler 定时任务 |
模块结构图
graph TD
ROOT["(根) JobData"] --> APP["app - FastAPI 后端"]
ROOT --> WEB["web - Vue3 前端"]
ROOT --> SPIDER["jobs_spider - 平台爬虫"]
ROOT --> ECS["ecs_full_pipeline.py - ECS 批量部署"]
APP --> APP_API["app/api - 路由层"]
APP --> APP_SVC["app/services - 业务逻辑"]
APP --> APP_CORE["app/core - 框架核心"]
APP --> APP_MODELS["app/models - ORM 模型"]
APP --> APP_REPO["app/repositories - 数据仓库"]
SPIDER --> SP_BOSS["jobs_spider/boss"]
SPIDER --> SP_QCWY["jobs_spider/qcwy"]
SPIDER --> SP_ZL["jobs_spider/zhilian"]
click APP "./app/CLAUDE.md" "查看 app 模块文档"
click WEB "./web/CLAUDE.md" "查看 web 模块文档"
click SPIDER "./jobs_spider/CLAUDE.md" "查看 jobs_spider 模块文档"
模块索引
| 模块路径 | 语言 | 职责简述 |
|---|---|---|
app/ |
Python | FastAPI 后端,提供 REST API、权限管理、定时任务、数据入库与分析 |
web/ |
Vue3/JS | 前端管理界面,数据展示、关键词管理、代理管理、数据清洗操作 |
jobs_spider/ |
Python | 三大平台的爬虫脚本,独立运行,结果通过 HTTP 推送到后端 |
ecs_full_pipeline.py |
Python | 阿里云 ECS 实例批量创建/销毁/命令下发全流程脚本 |
reclean_qcwy_jobs.py |
Python | 前程无忧数据重清洗独立脚本 |
运行与开发
后端启动
# 安装依赖(pipenv)
pipenv install
# 开发模式(默认端口 9999,20 个 worker)
python run.py
# 环境变量覆盖
APP_HOST=0.0.0.0 APP_PORT=9999 UVICORN_WORKERS=4 python run.py
关键环境变量
| 变量 | 默认值 | 说明 |
|---|---|---|
APP_HOST |
0.0.0.0 |
监听地址 |
APP_PORT |
9999 |
监听端口 |
UVICORN_WORKERS |
20 |
Worker 数量 |
CLICKHOUSE_HOST |
121.4.126.241 |
ClickHouse 地址(需修改为实际地址) |
CLICKHOUSE_USER / CLICKHOUSE_PASS |
见 config.py | ClickHouse 认证 |
SMTP_HOST / SMTP_USER / SMTP_PASS |
见 config.py | 邮件告警配置 |
REPORT_ENDPOINT |
空 | 统计结果 Webhook 上报地址 |
RUN_MIGRATIONS_ON_STARTUP |
True |
是否启动时自动迁移 |
INITIALIZE_SEED_DATA_ON_STARTUP |
True |
是否启动时初始化种子数据 |
安全警告:
config.py中SECRET_KEY、数据库连接串、SMTP 密码均为硬编码默认值,生产环境必须通过环境变量覆盖。
前端启动
cd web
pnpm install
pnpm dev # 开发模式,默认 http://localhost:5173
pnpm build # 构建产物到 web/dist
ECS 批量爬虫部署
# 需配置阿里云凭据(环境变量或 ~/.alibabacloud/credentials)
python ecs_full_pipeline.py
定时任务
APScheduler 在应用启动时注册以下任务(app/core/scheduler.py):
| 任务 ID | 频率 | 职责 |
|---|---|---|
stats_job |
每 6 小时 | 统计 ClickHouse 各表总量并通过邮件/Webhook 上报 |
ecs_full_pipeline |
每 6 小时 | 调用 ecs_full_pipeline.py 批量刷新爬虫节点 |
ip_alert_job |
每 10 分钟 | 检查 IP 上报异常并告警 |
company_cleaning_job |
每 5 分钟 | 自动清洗待处理公司数据(collect 50 + process 30) |
daily_cleanup_job |
每天 00:05 | 清理历史任务运行记录 |
所有任务通过分布式文件锁(或可选 Redis 锁)保证多 Worker 下只执行一次。
测试策略
- 当前代码库无自动化测试文件(缺口:单元测试、集成测试均缺失)。
- 推荐补充:
app/services/的 service 层单元测试(使用pytest+anyio)app/api/v1/的 API 集成测试(使用httpx.AsyncClient)jobs_spider/的数据解析函数单元测试
编码规范
- Python:使用
ruff(已在 Pipfile 中),格式化用black,排序用isort。 - 前端:ESLint(
@zclzone+@unocss规则集),prettier格式化。 - 类型:后端强制
pydanticSchema 做入参校验;前端以 JS 为主(未启用严格 TS)。 - 日志:后端统一使用
loguru,结构化字段logger.info(...)方式输出。
AI 使用指引
- 修改爬虫逻辑时,重点关注反爬机制:
SmartIPManager、IPAnomalyDetector在jobs_spider/boss/boos_api.py中实现,随机延迟至少 10 秒。 - 新增 API 路由后需同步在
app/api/v1/__init__.py注册,并执行api_controller.refresh_api()更新权限表。 - ClickHouse 表结构变更在
app/core/clickhouse_init.py中维护,不走 Aerich 迁移。 - MySQL 模型变更走 Aerich(
aerich migrate && aerich upgrade)。 - 前端新增页面需要在
web/src/views/{模块}/route.js和后端init_menus()中同步注册菜单。 config.py中已硬编码真实 MySQL/ClickHouse 连接串和 SMTP 凭据,提交代码前务必确认不泄露敏感信息。
Project
JobData 爬虫交互重构
招聘数据采集平台的爬虫交互层重构项目。重写三个平台(Boss直聘、前程无忧、智联招聘)的爬虫代码,统一基类和共享核心逻辑,优化数据入库去重、公司信息清洗流程和前端监控页面。
Core Value: 基于关键词驱动爬虫抓取职位数据,可靠地入库到 ClickHouse,并通过定时任务完成公司信息的采集和同步。
Constraints
- 兼容性: 重构期间不能中断现有爬虫数据采集
- 共享核心: 外部脚本和后端必须用同一套签名/请求/解析代码,避免代码重复
- 数据库: ClickHouse 表结构保持不变,MySQL 模型变更走 Aerich 迁移
- 反爬: 必须保留现有反爬机制(IP 轮换、随机延迟、代理管理)
Technology Stack
Languages
- Python 3.13 - Backend API, crawlers, ECS pipeline scripts
- JavaScript (ES Modules) - Vue3 frontend (no strict TypeScript; TS installed but JS used in
.vueand.jsfiles) - TypeScript 5.1.6 - Type definitions referenced in frontend toolchain only
Runtime
- Python 3.13 (required:
>=3.13perPipfile; Dockerfile usespython:3.11-slim-bullseyefor container build) - Node.js 18 (Dockerfile
node:18-alpinefor frontend build stage) - Python:
pipenv(development),pip/requirements.txt(Docker production) - Frontend:
pnpm(lockfile:web/pnpm-lock.yamlpresent) - Lockfiles:
Pipfile.lock(Python),web/pnpm-lock.yaml(frontend),uv.lock(uv-compatible)
Frameworks
- FastAPI 0.111.0 - REST API framework (
app/__init__.pyfactory viacreate_app()) - Starlette 0.37.2 - ASGI underpinning (middleware, static files)
- Uvicorn 0.34.0 - ASGI server (20 workers default, configurable via
UVICORN_WORKERS) - Uvloop 0.21.0 - High-performance event loop (non-Windows only)
- Tortoise-ORM 0.23.0 - Async ORM for MySQL (
app/models/) - Aerich 0.8.1 - Database migrations for Tortoise-ORM (
migrations/directory) - aiomysql - Async MySQL driver (Tortoise backend)
- clickhouse-connect 0.7.19 - ClickHouse async client (
app/core/clickhouse.py) - asyncpg - Async PostgreSQL driver (listed in Pipfile but not actively used)
- Pydantic 2.10.5 - Request/response schemas (
app/schemas/) - pydantic-settings 2.7.1 - Settings management via env vars (
app/settings/config.py) - orjson 3.10.14 - Fast JSON serialization
- ujson 5.10.0 - Alternative JSON library
- PyJWT 2.10.1 - JWT token generation/verification (HS256, 7-day expiry)
- passlib 1.7.4 - Password hashing
- argon2-cffi 23.1.0 - Argon2 password hasher
- APScheduler - Async job scheduler (
app/core/scheduler.py, 6 registered cron tasks) - httpx 0.28.1 - Async HTTP client (backend service-to-service calls)
- requests - Sync HTTP client (spider scripts in
jobs_spider/) - loguru 0.7.3 - Structured logging (unified throughout backend and spiders)
- Vue 3.3.4 - SPA framework (
web/src/main.js) - Vue Router 4.2.4 - Client-side routing (
web/src/router/index.js) - Pinia 2.1.6 - State management (
web/src/store/) - Naive UI 2.34.4 - Component library (admin UI)
- ECharts 6.0.0 - Data visualization charts (
web/src/views/analytics/) - axios 1.4.0 - HTTP client (
web/src/utils/http/) - vue-i18n 9 - Internationalization
- Vite 4.4.6 - Frontend bundler (
web/vite.config.js) - UnoCSS 66.5.10 - Atomic CSS engine
- unplugin-auto-import 20.3.0 - Auto-imports for Vue APIs
- unplugin-vue-components 30.0.0 - Auto-imports for components
- @iconify/vue + @iconify/json - Icon library
- playwright 1.57.0 - Browser automation (anti-detection in crawlers)
- PyExecJS 1.5.1 - Execute JavaScript signing algorithms from Python (
jobs_spider/boss/) - PySocks - SOCKS proxy support for spider requests
- tenacity - Retry logic in crawlers
- pandas + openpyxl - Data processing and Excel export
- alibabacloud_ecs20140526 - Alibaba Cloud ECS SDK (
ecs_full_pipeline.py) - alibabacloud_credentials - AliCloud credential management
- ruff 0.9.1 - Python linter (
pyproject.tomlconfig: line-length 120, ignores F403/F405) - black 24.10.0 - Python formatter (line-length 120, target py310/py311)
- isort 5.13.2 - Import sorter
- ESLint 8.46.0 - Frontend linter (
@zclzone+@unocssrule sets) - prettier - Frontend formatter
Key Dependencies
clickhouse-connect==0.7.19- All analytics and job data storage; loss means no data read/writetortoise-orm==0.23.0- All business data (users, roles, keywords, tokens); paired withaerichfor migrationsfastapi==0.111.0- API layer; version-pinned for stabilityAPScheduler- 6 scheduled tasks including ECS pipeline and IP alertingalibabacloud_ecs20140526- ECS node management; required for crawler scalinguvicorn==0.34.0+uvloop==0.21.0- Production ASGI server stackpydantic==2.10.5- All input validation; v2 API used throughoutloguru==0.7.3- Unified logging across all modulesredis- Optional distributed lock backend (app/core/locks.py; falls back to file locks if Redis unavailable)
Configuration
- All settings in
app/settings/config.pyviapydantic-settings.BaseSettings - Environment variables override defaults at startup
- No
.envfile detected (not committed); variables set at OS/container level - Key variables:
APP_HOST,APP_PORT,UVICORN_WORKERS,CLICKHOUSE_HOST,CLICKHOUSE_USER,CLICKHOUSE_PASS,SECRET_KEY,SMTP_HOST,SMTP_USER,SMTP_PASS,REDIS_HOST,REPORT_ENDPOINT - Security warning:
config.pycontains hardcoded default values for MySQL, ClickHouse, and SMTP credentials; must be overridden in production pyproject.toml- Python project metadata, black/ruff tool configPipfile/Pipfile.lock- Development dependency managementrequirements.txt- Production pip install (used in Dockerfile)web/vite.config.js- Vite build config with proxy support viaVITE_USE_PROXYenv varDockerfile- Multi-stage build: Node 18 (frontend) + Python 3.11 (backend) + nginx
Platform Requirements
- Python 3.13+ (Pipfile requirement)
- Node.js 18+ with pnpm
- MySQL server (Tortoise-ORM connection)
- ClickHouse server (analytics data)
- Optional: Redis (distributed locking upgrade)
- Docker-based deployment (see
Dockerfile,deploy/entrypoint.sh,deploy/web.conf) - Nginx serves frontend static files and proxies API requests
- Alibaba Cloud ECS (cn-shanghai-b zone) for crawler nodes
- Ports: 80 (nginx), 9999 (FastAPI backend direct)
Conventions
Naming Patterns
- Snake_case for all Python files:
company_storage.py,company_cleaner.py,clickhouse_repo.py - Private/internal modules prefixed with underscore:
_base.py,_boss_api.py,_boss_client.py,_boss_sign.py,_http_client.py - Platform-named service files:
boss.py,qcwy.py,zhilian.pyunderapp/services/crawler/ - Router files named after domain:
keyword.py,analytics.py,cleaning.py - PascalCase throughout:
CleaningService,KeywordController,ClickHouseBaseRepo,JobAnalyticsRepo - Services:
{Domain}Service—BossService,QcwyService,ZhilianService,IngestService,AnalyticsService - Controllers:
{Domain}Controller—KeywordController - Repos:
{Domain}Repoor{Domain}BaseRepo—ClickHouseBaseRepo,JobAnalyticsRepo - Models (Tortoise ORM):
{Platform}{Entity}—BossKeyword,QcwyCompany,ZhilianCompany - Schemas (Pydantic):
{Entity}Base,{Entity}Create,{Entity}Update,{Entity}Out— seeapp/schemas/keyword.py - Snake_case for all functions and methods:
get_available,report_page_progress,store_batch,build_insert_row - Private helpers prefixed with underscore:
_apply_proxy,_ensure_boss_token_loaded,_pick_first,_nested_get,_clean_text,_model_for_source - Async dependency factories follow pattern
get_{service/controller}():get_ingest_service,get_analytics_service,get_keyword_controller - Snake_case:
data_list,platform_type,check_duplicate,page_size - Module-level constants: UPPER_SNAKE_CASE —
COMPANY_SOURCES,QUEUE_TERMINAL_STATUSES - Class-level constants: UPPER_SNAKE_CASE prefixed
_—_TOKEN_REFRESH_INTERVAL = 3600 - Enums use PascalCase class name, UPPER_SNAKE_CASE values:
PlatformType.BOSS,ChannelType.MINI,DataType.JOB - Enum values are lowercase strings matching URL slugs:
"boss","mini","job"— seeapp/schemas/ingest.py - Enums inherit from
(str, Enum)enabling direct string comparison
Code Style
- Tool:
blackv24.10.0 - Line length: 120 characters (set in
pyproject.toml[tool.black]and[tool.ruff]) - Target Python versions: 3.10, 3.11 (black), 3.13 (Pipfile)
- Tool:
ruffv0.9.1 (configured inpyproject.toml) - Ignored rules:
F403(star imports),F405(may be undefined from star import) - Star imports from internal modules are allowed (used in
app/models/__init__.py,app/services/ingest/__init__.py) - Tool:
isortv5.13.2 - No explicit isort config found; follows default ordering
Import Organization
- None; all imports use full
app.prefix paths from app.log import loggeris the canonical loguru import path- Used only in
__init__.pyre-export files:from .admin import *inapp/models/__init__.py # noqa: F401, F403comments suppress lint warnings for intentional star imports
Error Handling
- Services return
Dict[str, Any]result objects with"success","code","message"fields instead of raising exceptions to callers - Controllers return dict with
"code": 200/400/404and"message"for all outcomes - API route handlers do NOT use try/except — they rely on services returning structured results
- Service methods wrap low-level calls in
try/except Exception as eand log then returnFalseor error dict
Logging
logger.info(f"...")for normal operation eventslogger.warning(f"...")for non-fatal recoverable issues (e.g., token not found, API soft failures)logger.error(f"...")for caught exceptions and operation failures- F-string interpolation used consistently for message formatting
- No structured fields (no
logger.bind()usage observed)
API Response Format
Comments
- Docstrings on public methods describing purpose, not implementation:
"""获取可用关键词,优先返回断点续爬和失败重试的关键词""" - Inline comments for priority logic and algorithm steps:
# 优先级 1: 断点续爬 (partial) - Module-level docstrings for context:
"""Boss直聘 Service — 基于新算法文件的封装""" # noqacomments for intentional lint suppressions- Not applicable (Python backend)
- Docstrings are brief single-line or short multi-line Chinese descriptions
Function Design
- Keyword arguments with defaults preferred for optional params
- Pydantic schemas used for HTTP request bodies (never raw dicts from router params)
Optional[str]with= Nonedefault for optional parameters- Services return
Dict[str, Any]with consistent keys (code,message,data) - Private helpers return
Optional[T]or primitive types - Async functions return awaitable results (no mixing of sync/async)
Module Design
app/models/__init__.pyusesfrom .{module} import *to flatten model imports- Router modules export a single named router variable:
router = APIRouter(...)or{domain}_router = APIRouter(...) - Service classes are imported directly by name
app/models/__init__.py— re-exports all model classesapp/services/ingest/__init__.py— re-exportsIngestServiceand config registrationsapp/api/v1/__init__.py— aggregates all routers intov1_router- FastAPI
Depends()used for service/controller instantiation in route handlers - Dependency factory functions named
get_{service}()and defined in the same file as the router - Shared auth dependencies:
DependAuth,DependPermissioninapp/core/dependency.py
Tortoise ORM Model Conventions
Pydantic Schema Conventions
- All schemas inherit from
pydantic.BaseModel - All fields use
Field(...)withdescription=for documentation - Enums inherit from
(str, Enum)for JSON serialization compatibility - Output schemas include
class Config: from_attributes = Trueto support ORM mode - Validation patterns use
Field(..., pattern="^(boss|qcwy|zhilian)$")for enum-like string fields
Architecture
Pattern Overview
- FastAPI async backend with dual-database persistence (MySQL for RBAC/metadata, ClickHouse for job/company analytics data)
- External crawlers (
jobs_spider/,spiderJobs/) run as independent processes and push data to the backend API via HTTP POST — no shared code path at runtime - RBAC permission model enforced at router-level via FastAPI
Depends - APScheduler drives all recurring automation (cleaning, ECS orchestration, alerting) within the main process
- Registry pattern (
app/services/ingest/registry.py) maps (platform, channel, data_type) tuples to ClickHouse table configs + dedup rules
Layers
- Purpose: Parse HTTP request, validate via Pydantic schema, delegate to service or controller, return JSON envelope
- Location:
app/api/v1/ - Contains: One sub-package per domain (e.g.,
job/,cleaning/,analytics/,keyword/) - Depends on: Services, Controllers,
app/core/dependency.pyfor auth - Used by: FastAPI application via
app/api/v1/__init__.py→app/__init__.py - Purpose: Business logic, orchestration across repos/models, side-effect coordination
- Location:
app/services/ - Contains:
IngestService,AnalyticsService,CleaningService,CompanyCleaner,CompanyJobsSyncService, platform crawler service wrappers (crawler/boss.py,crawler/qcwy.py,crawler/zhilian.py) - Depends on: Repositories, ORM models,
clickhouse_manager, external HTTP clients - Used by: Routers, APScheduler jobs
- Purpose: Thin CRUD orchestration for MySQL-backed entities (User, Role, API, Menu, Dept, etc.)
- Location:
app/controllers/ - Contains:
user_controller,api_controller,role_controller,menu_controller,dept_controller,keyword_controller,proxy_controller,token_controller,cleaning_controller,company_controller - Depends on: Tortoise-ORM models
- Used by: Routers
- Purpose: Encapsulate raw ClickHouse query execution behind typed methods
- Location:
app/repositories/ - Contains:
ClickHouseBaseRepo(generic query/insert),JobAnalyticsRepo(trend, distribution, count queries againstjob_analyticsview) - Depends on:
clickhouse_connect.driver.AsyncClient - Used by:
AnalyticsService - Purpose: ORM schema definitions for MySQL, migrated via Aerich
- Location:
app/models/ - Contains:
admin.py(User, Role, Api, Menu, Dept, AuditLog),token.py(BossToken),cleaning.py(CleaningTask),metrics.py(ScheduledTaskRun, StatsTotal, IpUploadStats),company.py(CompanyCleaningQueue),keyword.py(BossKeyword, QcwyKeyword, ZhilianKeyword) - Depends on:
tortoise-orm - Used by: Controllers, Services, Scheduler jobs
- Purpose: App wiring, connection management, middleware, cross-cutting utilities
- Location:
app/core/ - Contains:
init_app.py(lifespan bootstrap),scheduler.py(APScheduler jobs),clickhouse.py(ClickHouseManager singleton),clickhouse_init.py(ClickHouse DDL),dependency.py(AuthControl, PermissionControl),locks.py(DistributedLock),middlewares.py(BackGroundTaskMiddleware, HttpAuditLogMiddleware, IpTrackingMiddleware),exceptions.py,crud.py,ip_tracking.py - Depends on: Everything below
- Used by:
app/__init__.py, all layers above - Purpose: Platform-agnostic data ingestion pipeline with registry-driven routing, dedup, and remote push
- Location:
app/services/ingest/ - Contains:
service.py(IngestService),registry.py(PlatformConfig registry),dedup.py(batch dedup filter),remote_push.py(async HTTP push to secondary target),configs/(per-platform PlatformConfig registrations) - Depends on:
clickhouse_connect, registry configs - Used by:
app/api/v1/job/job.pyrouter - Purpose: Wrap platform HTTP clients so backend scheduler can call crawlers directly (company cleaning, company search)
- Location:
app/services/crawler/ - Contains:
_base.py(BaseFetcher, BaseSearcher, ApiResult),_http_client.py,_boss_client.py,_boss_api.py,_boss_sign.py,_zhilian_client.py,_zhilian_api.py,_zhilian_sign.py,_job51_client.py,_job51_api.py,_job51_sign.py,boss.py,qcwy.py,zhilian.py - Depends on:
requests(sync), platform sign/auth modules - Used by:
CompanyCleaner,CompanyController - Purpose: Standalone crawlers running on ECS nodes; push data to backend via
POST /api/v1/universal/data/batch-store-async - Location:
jobs_spider/boss/boos_api.py,jobs_spider/qcwy/qcwy.py,jobs_spider/zhilian/zhilian_single.py,spiderJobs/ - Depends on:
requests,loguru,PyExecJS; no shared Python imports withapp/ - Used by: Scheduled via
ecs_full_pipeline.pyor manual launch
Data Flow
- MySQL (via Tortoise-ORM): User sessions, RBAC entities, keyword crawl state, tokens, task run records
- ClickHouse: All raw job/company JSON data + analytics views; no ORM — raw SQL via
clickhouse_connect - Context variable
CTX_USER_ID(Pythoncontextvars.ContextVar) carries authenticated user ID within request scope (app/core/ctx.py) - File-based startup lock (
.startup_lock/dir) prevents multi-worker seed data collisions
Key Abstractions
- Purpose: Maps
(platform, channel, data_type)to target ClickHouse table and dedup field extractors - Examples:
app/services/ingest/registry.py,app/services/ingest/configs/ - Pattern: Immutable
@dataclass(frozen=True), registered at module import time viaregister() - Purpose: Prevents concurrent execution of scheduled jobs across multiple Uvicorn workers
- Examples:
app/core/locks.py - Pattern: Redis SET NX with TTL; falls back to filesystem directory creation with TTL metadata
- Purpose: Singleton lazy-init connection wrapper around
clickhouse_connect.AsyncClient - Examples:
app/core/clickhouse.py - Pattern: Global module-level instance
clickhouse_manager, injected into services at call time viaawait clickhouse_manager.get_client() - Purpose: Abstract base for platform-specific API clients within the in-process crawler wrappers
- Examples:
app/services/crawler/_base.py - Pattern: Template method — subclasses implement
_build_params(), base handles HTTP invocation andApiResultparsing - Purpose: Common ORM base with
BigIntField pk,to_dict()async serializer with M2M support - Examples:
app/models/base.py - Pattern: Mixed-in with
TimestampMixin(created_at, updated_at auto fields)
Entry Points
- Location:
run.py - Triggers:
python run.py— readsAPP_HOST,APP_PORT,UVICORN_WORKERSenv vars; callsuvicorn.run("app:app") - Responsibilities: Configure uvicorn logging format, start ASGI server
- Location:
app/__init__.py→create_app() - Triggers: Uvicorn import of
app:app - Responsibilities: Register middlewares, exception handlers, routers, static files; define lifespan
- Location:
app/__init__.py(lifespan context manager) +app/core/init_app.py(init_data()) - Triggers: FastAPI startup event
- Responsibilities: Tortoise-ORM init + schema generation → Aerich migrations → seed data (superuser, menus, APIs, roles) → ClickHouse table init → APScheduler start
- Location:
app/api/v1/__init__.py - Triggers: Imported by
app/api/__init__.py→app/__init__.py - Responsibilities: Mount all domain routers under
/api/v1/with appropriateDependPermissionguards - Location:
ecs_full_pipeline.py - Triggers: Manual execution or scheduled via
scheduler.py(ecs_full_pipeline_job) every 6 hours - Responsibilities: Create Alibaba Cloud ECS spot instances, push spider scripts, execute crawlers, terminate instances
Error Handling
- Tortoise
DoesNotExist→ 404 viaDoesNotExistHandle(registered inapp/core/init_app.py) - Pydantic
RequestValidationError→ 422 with field-level detail viaRequestValidationHandle IntegrityError(DB constraint) → 400 viaIntegrityHandle- Custom
HTTPException(fromapp/core/exceptions.py) → structured JSON error viaHttpExcHandle - Service layer exceptions logged with
loguruthen re-raised or returned as{"success": False, "error": "..."} - Scheduled jobs catch all exceptions, record failure via
_record_task_run()toScheduledTaskRuntable, then continue
Cross-Cutting Concerns
GSD Workflow Enforcement
Before using Edit, Write, or other file-changing tools, start work through a GSD command so planning artifacts and execution context stay in sync.
Use these entry points:
/gsd:quickfor small fixes, doc updates, and ad-hoc tasks/gsd:debugfor investigation and bug fixing/gsd:execute-phasefor planned phase work
Do not make direct repo edits outside a GSD workflow unless the user explicitly asks to bypass it.
Developer Profile
Profile not yet configured. Run
/gsd:profile-userto generate your developer profile. This section is managed bygenerate-claude-profile-- do not edit manually.