JobData/CLAUDE.md
2026-03-21 17:00:12 +08:00

28 KiB
Raw Permalink Blame History

JobData - 招聘数据采集与统计分析平台

变更记录 (Changelog)

版本 日期 说明
初始化 2026-03-20 首次生成架构文档,覆盖全部四个核心模块

项目愿景

JobData 是一个面向招聘市场的全栈数据采集与分析平台。系统从三大主流招聘平台Boss 直聘、前程无忧、智联招聘)自动抓取职位与公司数据,统一存储到 ClickHouse 列式数据库,并通过 FastAPI 后端 + Vue3 前端提供数据查看、定向清洗、统计分析等能力。ECS 弹性实例管理模块支持在阿里云上按需批量启停爬虫节点。


架构总览

[招聘平台]          [爬虫层]                  [后端 API]            [数据库]
Boss直聘  ──►  jobs_spider/boss/     ──►                      ┌── MySQL
前程无忧  ──►  jobs_spider/qcwy/     ──►  app (FastAPI)  ──►  └── ClickHouse
智联招聘  ──►  jobs_spider/zhilian/  ──►
                                                              ▲
                                                              │
                                          web (Vue3)  ────────┘
                                          (前端页面)

ecs_full_pipeline.py  ──►  阿里云 ECS  ──►  批量启动爬虫节点

核心技术栈

层次 技术
后端 Python 3.13, FastAPI 0.111, Tortoise-ORM 0.23, APScheduler
数据库(业务) MySQL用户/权限/审计/关键词/Token
数据库(采集) ClickHouse职位/公司 JSON 原始数据 + 分析视图)
爬虫 requests / httpx / PlaywrightPython 脚本)
前端 Vue 3.3, Vite 4, Naive UI, Pinia, ECharts
基础设施 阿里云 ECS按量抢占实例APScheduler 定时任务

模块结构图

graph TD
    ROOT["(根) JobData"] --> APP["app - FastAPI 后端"]
    ROOT --> WEB["web - Vue3 前端"]
    ROOT --> SPIDER["jobs_spider - 平台爬虫"]
    ROOT --> ECS["ecs_full_pipeline.py - ECS 批量部署"]

    APP --> APP_API["app/api - 路由层"]
    APP --> APP_SVC["app/services - 业务逻辑"]
    APP --> APP_CORE["app/core - 框架核心"]
    APP --> APP_MODELS["app/models - ORM 模型"]
    APP --> APP_REPO["app/repositories - 数据仓库"]

    SPIDER --> SP_BOSS["jobs_spider/boss"]
    SPIDER --> SP_QCWY["jobs_spider/qcwy"]
    SPIDER --> SP_ZL["jobs_spider/zhilian"]

    click APP "./app/CLAUDE.md" "查看 app 模块文档"
    click WEB "./web/CLAUDE.md" "查看 web 模块文档"
    click SPIDER "./jobs_spider/CLAUDE.md" "查看 jobs_spider 模块文档"

模块索引

模块路径 语言 职责简述
app/ Python FastAPI 后端,提供 REST API、权限管理、定时任务、数据入库与分析
web/ Vue3/JS 前端管理界面,数据展示、关键词管理、代理管理、数据清洗操作
jobs_spider/ Python 三大平台的爬虫脚本,独立运行,结果通过 HTTP 推送到后端
ecs_full_pipeline.py Python 阿里云 ECS 实例批量创建/销毁/命令下发全流程脚本
reclean_qcwy_jobs.py Python 前程无忧数据重清洗独立脚本

运行与开发

后端启动

# 安装依赖pipenv
pipenv install

# 开发模式(默认端口 999920 个 worker
python run.py

# 环境变量覆盖
APP_HOST=0.0.0.0 APP_PORT=9999 UVICORN_WORKERS=4 python run.py

关键环境变量

变量 默认值 说明
APP_HOST 0.0.0.0 监听地址
APP_PORT 9999 监听端口
UVICORN_WORKERS 20 Worker 数量
CLICKHOUSE_HOST 121.4.126.241 ClickHouse 地址(需修改为实际地址)
CLICKHOUSE_USER / CLICKHOUSE_PASS 见 config.py ClickHouse 认证
SMTP_HOST / SMTP_USER / SMTP_PASS 见 config.py 邮件告警配置
REPORT_ENDPOINT 统计结果 Webhook 上报地址
RUN_MIGRATIONS_ON_STARTUP True 是否启动时自动迁移
INITIALIZE_SEED_DATA_ON_STARTUP True 是否启动时初始化种子数据

安全警告:config.pySECRET_KEY、数据库连接串、SMTP 密码均为硬编码默认值,生产环境必须通过环境变量覆盖。

前端启动

cd web
pnpm install
pnpm dev        # 开发模式,默认 http://localhost:5173
pnpm build      # 构建产物到 web/dist

ECS 批量爬虫部署

# 需配置阿里云凭据(环境变量或 ~/.alibabacloud/credentials
python ecs_full_pipeline.py

定时任务

APScheduler 在应用启动时注册以下任务(app/core/scheduler.py

任务 ID 频率 职责
stats_job 每 6 小时 统计 ClickHouse 各表总量并通过邮件/Webhook 上报
ecs_full_pipeline 每 6 小时 调用 ecs_full_pipeline.py 批量刷新爬虫节点
ip_alert_job 每 10 分钟 检查 IP 上报异常并告警
company_cleaning_job 每 5 分钟 自动清洗待处理公司数据collect 50 + process 30
daily_cleanup_job 每天 00:05 清理历史任务运行记录

所有任务通过分布式文件锁(或可选 Redis 锁)保证多 Worker 下只执行一次。


测试策略

  • 当前代码库无自动化测试文件(缺口:单元测试、集成测试均缺失)。
  • 推荐补充:
    1. app/services/ 的 service 层单元测试(使用 pytest + anyio
    2. app/api/v1/ 的 API 集成测试(使用 httpx.AsyncClient
    3. jobs_spider/ 的数据解析函数单元测试

编码规范

  • Python使用 ruff(已在 Pipfile 中),格式化用 black,排序用 isort
  • 前端ESLint@zclzone + @unocss 规则集),prettier 格式化。
  • 类型:后端强制 pydantic Schema 做入参校验;前端以 JS 为主(未启用严格 TS
  • 日志:后端统一使用 loguru,结构化字段 logger.info(...) 方式输出。

AI 使用指引

  • 修改爬虫逻辑时,重点关注反爬机制:SmartIPManagerIPAnomalyDetectorjobs_spider/boss/boos_api.py 中实现,随机延迟至少 10 秒。
  • 新增 API 路由后需同步在 app/api/v1/__init__.py 注册,并执行 api_controller.refresh_api() 更新权限表。
  • ClickHouse 表结构变更在 app/core/clickhouse_init.py 中维护,不走 Aerich 迁移
  • MySQL 模型变更走 Aerichaerich migrate && aerich upgrade)。
  • 前端新增页面需要在 web/src/views/{模块}/route.js 和后端 init_menus() 中同步注册菜单。
  • config.py 中已硬编码真实 MySQL/ClickHouse 连接串和 SMTP 凭据,提交代码前务必确认不泄露敏感信息

Project

JobData 爬虫交互重构

招聘数据采集平台的爬虫交互层重构项目。重写三个平台Boss直聘、前程无忧、智联招聘的爬虫代码统一基类和共享核心逻辑优化数据入库去重、公司信息清洗流程和前端监控页面。

Core Value: 基于关键词驱动爬虫抓取职位数据,可靠地入库到 ClickHouse并通过定时任务完成公司信息的采集和同步。

Constraints

  • 兼容性: 重构期间不能中断现有爬虫数据采集
  • 共享核心: 外部脚本和后端必须用同一套签名/请求/解析代码,避免代码重复
  • 数据库: ClickHouse 表结构保持不变MySQL 模型变更走 Aerich 迁移
  • 反爬: 必须保留现有反爬机制IP 轮换、随机延迟、代理管理)

Technology Stack

Languages

  • Python 3.13 - Backend API, crawlers, ECS pipeline scripts
  • JavaScript (ES Modules) - Vue3 frontend (no strict TypeScript; TS installed but JS used in .vue and .js files)
  • TypeScript 5.1.6 - Type definitions referenced in frontend toolchain only

Runtime

  • Python 3.13 (required: >=3.13 per Pipfile; Dockerfile uses python:3.11-slim-bullseye for container build)
  • Node.js 18 (Dockerfile node:18-alpine for frontend build stage)
  • Python: pipenv (development), pip / requirements.txt (Docker production)
  • Frontend: pnpm (lockfile: web/pnpm-lock.yaml present)
  • Lockfiles: Pipfile.lock (Python), web/pnpm-lock.yaml (frontend), uv.lock (uv-compatible)

Frameworks

  • FastAPI 0.111.0 - REST API framework (app/__init__.py factory via create_app())
  • Starlette 0.37.2 - ASGI underpinning (middleware, static files)
  • Uvicorn 0.34.0 - ASGI server (20 workers default, configurable via UVICORN_WORKERS)
  • Uvloop 0.21.0 - High-performance event loop (non-Windows only)
  • Tortoise-ORM 0.23.0 - Async ORM for MySQL (app/models/)
  • Aerich 0.8.1 - Database migrations for Tortoise-ORM (migrations/ directory)
  • aiomysql - Async MySQL driver (Tortoise backend)
  • clickhouse-connect 0.7.19 - ClickHouse async client (app/core/clickhouse.py)
  • asyncpg - Async PostgreSQL driver (listed in Pipfile but not actively used)
  • Pydantic 2.10.5 - Request/response schemas (app/schemas/)
  • pydantic-settings 2.7.1 - Settings management via env vars (app/settings/config.py)
  • orjson 3.10.14 - Fast JSON serialization
  • ujson 5.10.0 - Alternative JSON library
  • PyJWT 2.10.1 - JWT token generation/verification (HS256, 7-day expiry)
  • passlib 1.7.4 - Password hashing
  • argon2-cffi 23.1.0 - Argon2 password hasher
  • APScheduler - Async job scheduler (app/core/scheduler.py, 6 registered cron tasks)
  • httpx 0.28.1 - Async HTTP client (backend service-to-service calls)
  • requests - Sync HTTP client (spider scripts in jobs_spider/)
  • loguru 0.7.3 - Structured logging (unified throughout backend and spiders)
  • Vue 3.3.4 - SPA framework (web/src/main.js)
  • Vue Router 4.2.4 - Client-side routing (web/src/router/index.js)
  • Pinia 2.1.6 - State management (web/src/store/)
  • Naive UI 2.34.4 - Component library (admin UI)
  • ECharts 6.0.0 - Data visualization charts (web/src/views/analytics/)
  • axios 1.4.0 - HTTP client (web/src/utils/http/)
  • vue-i18n 9 - Internationalization
  • Vite 4.4.6 - Frontend bundler (web/vite.config.js)
  • UnoCSS 66.5.10 - Atomic CSS engine
  • unplugin-auto-import 20.3.0 - Auto-imports for Vue APIs
  • unplugin-vue-components 30.0.0 - Auto-imports for components
  • @iconify/vue + @iconify/json - Icon library
  • playwright 1.57.0 - Browser automation (anti-detection in crawlers)
  • PyExecJS 1.5.1 - Execute JavaScript signing algorithms from Python (jobs_spider/boss/)
  • PySocks - SOCKS proxy support for spider requests
  • tenacity - Retry logic in crawlers
  • pandas + openpyxl - Data processing and Excel export
  • alibabacloud_ecs20140526 - Alibaba Cloud ECS SDK (ecs_full_pipeline.py)
  • alibabacloud_credentials - AliCloud credential management
  • ruff 0.9.1 - Python linter (pyproject.toml config: line-length 120, ignores F403/F405)
  • black 24.10.0 - Python formatter (line-length 120, target py310/py311)
  • isort 5.13.2 - Import sorter
  • ESLint 8.46.0 - Frontend linter (@zclzone + @unocss rule sets)
  • prettier - Frontend formatter

Key Dependencies

  • clickhouse-connect==0.7.19 - All analytics and job data storage; loss means no data read/write
  • tortoise-orm==0.23.0 - All business data (users, roles, keywords, tokens); paired with aerich for migrations
  • fastapi==0.111.0 - API layer; version-pinned for stability
  • APScheduler - 6 scheduled tasks including ECS pipeline and IP alerting
  • alibabacloud_ecs20140526 - ECS node management; required for crawler scaling
  • uvicorn==0.34.0 + uvloop==0.21.0 - Production ASGI server stack
  • pydantic==2.10.5 - All input validation; v2 API used throughout
  • loguru==0.7.3 - Unified logging across all modules
  • redis - Optional distributed lock backend (app/core/locks.py; falls back to file locks if Redis unavailable)

Configuration

  • All settings in app/settings/config.py via pydantic-settings.BaseSettings
  • Environment variables override defaults at startup
  • No .env file detected (not committed); variables set at OS/container level
  • Key variables: APP_HOST, APP_PORT, UVICORN_WORKERS, CLICKHOUSE_HOST, CLICKHOUSE_USER, CLICKHOUSE_PASS, SECRET_KEY, SMTP_HOST, SMTP_USER, SMTP_PASS, REDIS_HOST, REPORT_ENDPOINT
  • Security warning: config.py contains hardcoded default values for MySQL, ClickHouse, and SMTP credentials; must be overridden in production
  • pyproject.toml - Python project metadata, black/ruff tool config
  • Pipfile / Pipfile.lock - Development dependency management
  • requirements.txt - Production pip install (used in Dockerfile)
  • web/vite.config.js - Vite build config with proxy support via VITE_USE_PROXY env var
  • Dockerfile - Multi-stage build: Node 18 (frontend) + Python 3.11 (backend) + nginx

Platform Requirements

  • Python 3.13+ (Pipfile requirement)
  • Node.js 18+ with pnpm
  • MySQL server (Tortoise-ORM connection)
  • ClickHouse server (analytics data)
  • Optional: Redis (distributed locking upgrade)
  • Docker-based deployment (see Dockerfile, deploy/entrypoint.sh, deploy/web.conf)
  • Nginx serves frontend static files and proxies API requests
  • Alibaba Cloud ECS (cn-shanghai-b zone) for crawler nodes
  • Ports: 80 (nginx), 9999 (FastAPI backend direct)

Conventions

Naming Patterns

  • Snake_case for all Python files: company_storage.py, company_cleaner.py, clickhouse_repo.py
  • Private/internal modules prefixed with underscore: _base.py, _boss_api.py, _boss_client.py, _boss_sign.py, _http_client.py
  • Platform-named service files: boss.py, qcwy.py, zhilian.py under app/services/crawler/
  • Router files named after domain: keyword.py, analytics.py, cleaning.py
  • PascalCase throughout: CleaningService, KeywordController, ClickHouseBaseRepo, JobAnalyticsRepo
  • Services: {Domain}ServiceBossService, QcwyService, ZhilianService, IngestService, AnalyticsService
  • Controllers: {Domain}ControllerKeywordController
  • Repos: {Domain}Repo or {Domain}BaseRepoClickHouseBaseRepo, JobAnalyticsRepo
  • Models (Tortoise ORM): {Platform}{Entity}BossKeyword, QcwyCompany, ZhilianCompany
  • Schemas (Pydantic): {Entity}Base, {Entity}Create, {Entity}Update, {Entity}Out — see app/schemas/keyword.py
  • Snake_case for all functions and methods: get_available, report_page_progress, store_batch, build_insert_row
  • Private helpers prefixed with underscore: _apply_proxy, _ensure_boss_token_loaded, _pick_first, _nested_get, _clean_text, _model_for_source
  • Async dependency factories follow pattern get_{service/controller}(): get_ingest_service, get_analytics_service, get_keyword_controller
  • Snake_case: data_list, platform_type, check_duplicate, page_size
  • Module-level constants: UPPER_SNAKE_CASE — COMPANY_SOURCES, QUEUE_TERMINAL_STATUSES
  • Class-level constants: UPPER_SNAKE_CASE prefixed __TOKEN_REFRESH_INTERVAL = 3600
  • Enums use PascalCase class name, UPPER_SNAKE_CASE values: PlatformType.BOSS, ChannelType.MINI, DataType.JOB
  • Enum values are lowercase strings matching URL slugs: "boss", "mini", "job" — see app/schemas/ingest.py
  • Enums inherit from (str, Enum) enabling direct string comparison

Code Style

  • Tool: black v24.10.0
  • Line length: 120 characters (set in pyproject.toml [tool.black] and [tool.ruff])
  • Target Python versions: 3.10, 3.11 (black), 3.13 (Pipfile)
  • Tool: ruff v0.9.1 (configured in pyproject.toml)
  • Ignored rules: F403 (star imports), F405 (may be undefined from star import)
  • Star imports from internal modules are allowed (used in app/models/__init__.py, app/services/ingest/__init__.py)
  • Tool: isort v5.13.2
  • No explicit isort config found; follows default ordering

Import Organization

  • None; all imports use full app. prefix paths
  • from app.log import logger is the canonical loguru import path
  • Used only in __init__.py re-export files: from .admin import * in app/models/__init__.py
  • # noqa: F401, F403 comments suppress lint warnings for intentional star imports

Error Handling

  • Services return Dict[str, Any] result objects with "success", "code", "message" fields instead of raising exceptions to callers
  • Controllers return dict with "code": 200/400/404 and "message" for all outcomes
  • API route handlers do NOT use try/except — they rely on services returning structured results
  • Service methods wrap low-level calls in try/except Exception as e and log then return False or error dict

Logging

  • logger.info(f"...") for normal operation events
  • logger.warning(f"...") for non-fatal recoverable issues (e.g., token not found, API soft failures)
  • logger.error(f"...") for caught exceptions and operation failures
  • F-string interpolation used consistently for message formatting
  • No structured fields (no logger.bind() usage observed)

API Response Format

Comments

  • Docstrings on public methods describing purpose, not implementation: """获取可用关键词,优先返回断点续爬和失败重试的关键词"""
  • Inline comments for priority logic and algorithm steps: # 优先级 1: 断点续爬 (partial)
  • Module-level docstrings for context: """Boss直聘 Service — 基于新算法文件的封装"""
  • # noqa comments for intentional lint suppressions
  • Not applicable (Python backend)
  • Docstrings are brief single-line or short multi-line Chinese descriptions

Function Design

  • Keyword arguments with defaults preferred for optional params
  • Pydantic schemas used for HTTP request bodies (never raw dicts from router params)
  • Optional[str] with = None default for optional parameters
  • Services return Dict[str, Any] with consistent keys (code, message, data)
  • Private helpers return Optional[T] or primitive types
  • Async functions return awaitable results (no mixing of sync/async)

Module Design

  • app/models/__init__.py uses from .{module} import * to flatten model imports
  • Router modules export a single named router variable: router = APIRouter(...) or {domain}_router = APIRouter(...)
  • Service classes are imported directly by name
  • app/models/__init__.py — re-exports all model classes
  • app/services/ingest/__init__.py — re-exports IngestService and config registrations
  • app/api/v1/__init__.py — aggregates all routers into v1_router
  • FastAPI Depends() used for service/controller instantiation in route handlers
  • Dependency factory functions named get_{service}() and defined in the same file as the router
  • Shared auth dependencies: DependAuth, DependPermission in app/core/dependency.py

Tortoise ORM Model Conventions

Pydantic Schema Conventions

  • All schemas inherit from pydantic.BaseModel
  • All fields use Field(...) with description= for documentation
  • Enums inherit from (str, Enum) for JSON serialization compatibility
  • Output schemas include class Config: from_attributes = True to support ORM mode
  • Validation patterns use Field(..., pattern="^(boss|qcwy|zhilian)$") for enum-like string fields

Architecture

Pattern Overview

  • FastAPI async backend with dual-database persistence (MySQL for RBAC/metadata, ClickHouse for job/company analytics data)
  • External crawlers (jobs_spider/, spiderJobs/) run as independent processes and push data to the backend API via HTTP POST — no shared code path at runtime
  • RBAC permission model enforced at router-level via FastAPI Depends
  • APScheduler drives all recurring automation (cleaning, ECS orchestration, alerting) within the main process
  • Registry pattern (app/services/ingest/registry.py) maps (platform, channel, data_type) tuples to ClickHouse table configs + dedup rules

Layers

  • Purpose: Parse HTTP request, validate via Pydantic schema, delegate to service or controller, return JSON envelope
  • Location: app/api/v1/
  • Contains: One sub-package per domain (e.g., job/, cleaning/, analytics/, keyword/)
  • Depends on: Services, Controllers, app/core/dependency.py for auth
  • Used by: FastAPI application via app/api/v1/__init__.pyapp/__init__.py
  • Purpose: Business logic, orchestration across repos/models, side-effect coordination
  • Location: app/services/
  • Contains: IngestService, AnalyticsService, CleaningService, CompanyCleaner, CompanyJobsSyncService, platform crawler service wrappers (crawler/boss.py, crawler/qcwy.py, crawler/zhilian.py)
  • Depends on: Repositories, ORM models, clickhouse_manager, external HTTP clients
  • Used by: Routers, APScheduler jobs
  • Purpose: Thin CRUD orchestration for MySQL-backed entities (User, Role, API, Menu, Dept, etc.)
  • Location: app/controllers/
  • Contains: user_controller, api_controller, role_controller, menu_controller, dept_controller, keyword_controller, proxy_controller, token_controller, cleaning_controller, company_controller
  • Depends on: Tortoise-ORM models
  • Used by: Routers
  • Purpose: Encapsulate raw ClickHouse query execution behind typed methods
  • Location: app/repositories/
  • Contains: ClickHouseBaseRepo (generic query/insert), JobAnalyticsRepo (trend, distribution, count queries against job_analytics view)
  • Depends on: clickhouse_connect.driver.AsyncClient
  • Used by: AnalyticsService
  • Purpose: ORM schema definitions for MySQL, migrated via Aerich
  • Location: app/models/
  • Contains: admin.py (User, Role, Api, Menu, Dept, AuditLog), token.py (BossToken), cleaning.py (CleaningTask), metrics.py (ScheduledTaskRun, StatsTotal, IpUploadStats), company.py (CompanyCleaningQueue), keyword.py (BossKeyword, QcwyKeyword, ZhilianKeyword)
  • Depends on: tortoise-orm
  • Used by: Controllers, Services, Scheduler jobs
  • Purpose: App wiring, connection management, middleware, cross-cutting utilities
  • Location: app/core/
  • Contains: init_app.py (lifespan bootstrap), scheduler.py (APScheduler jobs), clickhouse.py (ClickHouseManager singleton), clickhouse_init.py (ClickHouse DDL), dependency.py (AuthControl, PermissionControl), locks.py (DistributedLock), middlewares.py (BackGroundTaskMiddleware, HttpAuditLogMiddleware, IpTrackingMiddleware), exceptions.py, crud.py, ip_tracking.py
  • Depends on: Everything below
  • Used by: app/__init__.py, all layers above
  • Purpose: Platform-agnostic data ingestion pipeline with registry-driven routing, dedup, and remote push
  • Location: app/services/ingest/
  • Contains: service.py (IngestService), registry.py (PlatformConfig registry), dedup.py (batch dedup filter), remote_push.py (async HTTP push to secondary target), configs/ (per-platform PlatformConfig registrations)
  • Depends on: clickhouse_connect, registry configs
  • Used by: app/api/v1/job/job.py router
  • Purpose: Wrap platform HTTP clients so backend scheduler can call crawlers directly (company cleaning, company search)
  • Location: app/services/crawler/
  • Contains: _base.py (BaseFetcher, BaseSearcher, ApiResult), _http_client.py, _boss_client.py, _boss_api.py, _boss_sign.py, _zhilian_client.py, _zhilian_api.py, _zhilian_sign.py, _job51_client.py, _job51_api.py, _job51_sign.py, boss.py, qcwy.py, zhilian.py
  • Depends on: requests (sync), platform sign/auth modules
  • Used by: CompanyCleaner, CompanyController
  • Purpose: Standalone crawlers running on ECS nodes; push data to backend via POST /api/v1/universal/data/batch-store-async
  • Location: jobs_spider/boss/boos_api.py, jobs_spider/qcwy/qcwy.py, jobs_spider/zhilian/zhilian_single.py, spiderJobs/
  • Depends on: requests, loguru, PyExecJS; no shared Python imports with app/
  • Used by: Scheduled via ecs_full_pipeline.py or manual launch

Data Flow

  • MySQL (via Tortoise-ORM): User sessions, RBAC entities, keyword crawl state, tokens, task run records
  • ClickHouse: All raw job/company JSON data + analytics views; no ORM — raw SQL via clickhouse_connect
  • Context variable CTX_USER_ID (Python contextvars.ContextVar) carries authenticated user ID within request scope (app/core/ctx.py)
  • File-based startup lock (.startup_lock/ dir) prevents multi-worker seed data collisions

Key Abstractions

  • Purpose: Maps (platform, channel, data_type) to target ClickHouse table and dedup field extractors
  • Examples: app/services/ingest/registry.py, app/services/ingest/configs/
  • Pattern: Immutable @dataclass(frozen=True), registered at module import time via register()
  • Purpose: Prevents concurrent execution of scheduled jobs across multiple Uvicorn workers
  • Examples: app/core/locks.py
  • Pattern: Redis SET NX with TTL; falls back to filesystem directory creation with TTL metadata
  • Purpose: Singleton lazy-init connection wrapper around clickhouse_connect.AsyncClient
  • Examples: app/core/clickhouse.py
  • Pattern: Global module-level instance clickhouse_manager, injected into services at call time via await clickhouse_manager.get_client()
  • Purpose: Abstract base for platform-specific API clients within the in-process crawler wrappers
  • Examples: app/services/crawler/_base.py
  • Pattern: Template method — subclasses implement _build_params(), base handles HTTP invocation and ApiResult parsing
  • Purpose: Common ORM base with BigIntField pk, to_dict() async serializer with M2M support
  • Examples: app/models/base.py
  • Pattern: Mixed-in with TimestampMixin (created_at, updated_at auto fields)

Entry Points

  • Location: run.py
  • Triggers: python run.py — reads APP_HOST, APP_PORT, UVICORN_WORKERS env vars; calls uvicorn.run("app:app")
  • Responsibilities: Configure uvicorn logging format, start ASGI server
  • Location: app/__init__.pycreate_app()
  • Triggers: Uvicorn import of app:app
  • Responsibilities: Register middlewares, exception handlers, routers, static files; define lifespan
  • Location: app/__init__.py (lifespan context manager) + app/core/init_app.py (init_data())
  • Triggers: FastAPI startup event
  • Responsibilities: Tortoise-ORM init + schema generation → Aerich migrations → seed data (superuser, menus, APIs, roles) → ClickHouse table init → APScheduler start
  • Location: app/api/v1/__init__.py
  • Triggers: Imported by app/api/__init__.pyapp/__init__.py
  • Responsibilities: Mount all domain routers under /api/v1/ with appropriate DependPermission guards
  • Location: ecs_full_pipeline.py
  • Triggers: Manual execution or scheduled via scheduler.py (ecs_full_pipeline_job) every 6 hours
  • Responsibilities: Create Alibaba Cloud ECS spot instances, push spider scripts, execute crawlers, terminate instances

Error Handling

  • Tortoise DoesNotExist → 404 via DoesNotExistHandle (registered in app/core/init_app.py)
  • Pydantic RequestValidationError → 422 with field-level detail via RequestValidationHandle
  • IntegrityError (DB constraint) → 400 via IntegrityHandle
  • Custom HTTPException (from app/core/exceptions.py) → structured JSON error via HttpExcHandle
  • Service layer exceptions logged with loguru then re-raised or returned as {"success": False, "error": "..."}
  • Scheduled jobs catch all exceptions, record failure via _record_task_run() to ScheduledTaskRun table, then continue

Cross-Cutting Concerns

GSD Workflow Enforcement

Before using Edit, Write, or other file-changing tools, start work through a GSD command so planning artifacts and execution context stay in sync.

Use these entry points:

  • /gsd:quick for small fixes, doc updates, and ad-hoc tasks
  • /gsd:debug for investigation and bug fixing
  • /gsd:execute-phase for planned phase work

Do not make direct repo edits outside a GSD workflow unless the user explicitly asks to bypass it.

Developer Profile

Profile not yet configured. Run /gsd:profile-user to generate your developer profile. This section is managed by generate-claude-profile -- do not edit manually.