win 44b5f390aa docs: create roadmap (6 phases)

2026-03-21 17:00:12 +08:00

28 KiB

Raw Permalink Blame History

JobData - 招聘数据采集与统计分析平台

变更记录 (Changelog)

版本	日期	说明
初始化	2026-03-20	首次生成架构文档，覆盖全部四个核心模块

项目愿景

JobData 是一个面向招聘市场的全栈数据采集与分析平台。系统从三大主流招聘平台（Boss 直聘、前程无忧、智联招聘）自动抓取职位与公司数据，统一存储到 ClickHouse 列式数据库，并通过 FastAPI 后端 + Vue3 前端提供数据查看、定向清洗、统计分析等能力。ECS 弹性实例管理模块支持在阿里云上按需批量启停爬虫节点。

架构总览

[招聘平台]          [爬虫层]                  [后端 API]            [数据库]
Boss直聘  ──►  jobs_spider/boss/     ──►                      ┌── MySQL
前程无忧  ──►  jobs_spider/qcwy/     ──►  app (FastAPI)  ──►  └── ClickHouse
智联招聘  ──►  jobs_spider/zhilian/  ──►
                                                              ▲
                                                              │
                                          web (Vue3)  ────────┘
                                          (前端页面)

ecs_full_pipeline.py  ──►  阿里云 ECS  ──►  批量启动爬虫节点

核心技术栈

层次	技术
后端	Python 3.13, FastAPI 0.111, Tortoise-ORM 0.23, APScheduler
数据库（业务）	MySQL（用户/权限/审计/关键词/Token）
数据库（采集）	ClickHouse（职位/公司 JSON 原始数据 + 分析视图）
爬虫	requests / httpx / Playwright（Python 脚本）
前端	Vue 3.3, Vite 4, Naive UI, Pinia, ECharts
基础设施	阿里云 ECS（按量抢占实例），APScheduler 定时任务

模块结构图

graph TD
    ROOT["(根) JobData"] --> APP["app - FastAPI 后端"]
    ROOT --> WEB["web - Vue3 前端"]
    ROOT --> SPIDER["jobs_spider - 平台爬虫"]
    ROOT --> ECS["ecs_full_pipeline.py - ECS 批量部署"]

    APP --> APP_API["app/api - 路由层"]
    APP --> APP_SVC["app/services - 业务逻辑"]
    APP --> APP_CORE["app/core - 框架核心"]
    APP --> APP_MODELS["app/models - ORM 模型"]
    APP --> APP_REPO["app/repositories - 数据仓库"]

    SPIDER --> SP_BOSS["jobs_spider/boss"]
    SPIDER --> SP_QCWY["jobs_spider/qcwy"]
    SPIDER --> SP_ZL["jobs_spider/zhilian"]

    click APP "./app/CLAUDE.md" "查看 app 模块文档"
    click WEB "./web/CLAUDE.md" "查看 web 模块文档"
    click SPIDER "./jobs_spider/CLAUDE.md" "查看 jobs_spider 模块文档"

模块索引

模块路径	语言	职责简述
`app/`	Python	FastAPI 后端，提供 REST API、权限管理、定时任务、数据入库与分析
`web/`	Vue3/JS	前端管理界面，数据展示、关键词管理、代理管理、数据清洗操作
`jobs_spider/`	Python	三大平台的爬虫脚本，独立运行，结果通过 HTTP 推送到后端
`ecs_full_pipeline.py`	Python	阿里云 ECS 实例批量创建/销毁/命令下发全流程脚本
`reclean_qcwy_jobs.py`	Python	前程无忧数据重清洗独立脚本

运行与开发

后端启动

# 安装依赖（pipenv）
pipenv install

# 开发模式（默认端口 9999，20 个 worker）
python run.py

# 环境变量覆盖
APP_HOST=0.0.0.0 APP_PORT=9999 UVICORN_WORKERS=4 python run.py

关键环境变量

变量	默认值	说明
`APP_HOST`	`0.0.0.0`	监听地址
`APP_PORT`	`9999`	监听端口
`UVICORN_WORKERS`	`20`	Worker 数量
`CLICKHOUSE_HOST`	`121.4.126.241`	ClickHouse 地址（需修改为实际地址）
`CLICKHOUSE_USER` / `CLICKHOUSE_PASS`	见 config.py	ClickHouse 认证
`SMTP_HOST` / `SMTP_USER` / `SMTP_PASS`	见 config.py	邮件告警配置
`REPORT_ENDPOINT`	空	统计结果 Webhook 上报地址
`RUN_MIGRATIONS_ON_STARTUP`	`True`	是否启动时自动迁移
`INITIALIZE_SEED_DATA_ON_STARTUP`	`True`	是否启动时初始化种子数据

安全警告：config.py 中 SECRET_KEY、数据库连接串、SMTP 密码均为硬编码默认值，生产环境必须通过环境变量覆盖。

前端启动

cd web
pnpm install
pnpm dev        # 开发模式，默认 http://localhost:5173
pnpm build      # 构建产物到 web/dist

ECS 批量爬虫部署

# 需配置阿里云凭据（环境变量或 ~/.alibabacloud/credentials）
python ecs_full_pipeline.py

定时任务

APScheduler 在应用启动时注册以下任务（app/core/scheduler.py）：

任务 ID	频率	职责
`stats_job`	每 6 小时	统计 ClickHouse 各表总量并通过邮件/Webhook 上报
`ecs_full_pipeline`	每 6 小时	调用 `ecs_full_pipeline.py` 批量刷新爬虫节点
`ip_alert_job`	每 10 分钟	检查 IP 上报异常并告警
`company_cleaning_job`	每 5 分钟	自动清洗待处理公司数据（collect 50 + process 30）
`daily_cleanup_job`	每天 00:05	清理历史任务运行记录

所有任务通过分布式文件锁（或可选 Redis 锁）保证多 Worker 下只执行一次。

测试策略

当前代码库无自动化测试文件（缺口：单元测试、集成测试均缺失）。
推荐补充：
1. app/services/ 的 service 层单元测试（使用 pytest + anyio）
2. app/api/v1/ 的 API 集成测试（使用 httpx.AsyncClient）
3. jobs_spider/ 的数据解析函数单元测试

编码规范

Python：使用 ruff（已在 Pipfile 中），格式化用 black，排序用 isort。
前端：ESLint（@zclzone + @unocss 规则集），prettier 格式化。
类型：后端强制 pydantic Schema 做入参校验；前端以 JS 为主（未启用严格 TS）。
日志：后端统一使用 loguru，结构化字段 logger.info(...) 方式输出。

AI 使用指引

修改爬虫逻辑时，重点关注反爬机制：SmartIPManager、IPAnomalyDetector 在 jobs_spider/boss/boos_api.py 中实现，随机延迟至少 10 秒。
新增 API 路由后需同步在 app/api/v1/__init__.py 注册，并执行 api_controller.refresh_api() 更新权限表。
ClickHouse 表结构变更在 app/core/clickhouse_init.py 中维护，不走 Aerich 迁移。
MySQL 模型变更走 Aerich（aerich migrate && aerich upgrade）。
前端新增页面需要在 web/src/views/{模块}/route.js 和后端 init_menus() 中同步注册菜单。
config.py 中已硬编码真实 MySQL/ClickHouse 连接串和 SMTP 凭据，提交代码前务必确认不泄露敏感信息。

Project

JobData 爬虫交互重构

招聘数据采集平台的爬虫交互层重构项目。重写三个平台（Boss直聘、前程无忧、智联招聘）的爬虫代码，统一基类和共享核心逻辑，优化数据入库去重、公司信息清洗流程和前端监控页面。

Core Value: 基于关键词驱动爬虫抓取职位数据，可靠地入库到 ClickHouse，并通过定时任务完成公司信息的采集和同步。

Constraints

兼容性: 重构期间不能中断现有爬虫数据采集
共享核心: 外部脚本和后端必须用同一套签名/请求/解析代码，避免代码重复
数据库: ClickHouse 表结构保持不变，MySQL 模型变更走 Aerich 迁移
反爬: 必须保留现有反爬机制（IP 轮换、随机延迟、代理管理）

Technology Stack

Languages

Python 3.13 - Backend API, crawlers, ECS pipeline scripts
JavaScript (ES Modules) - Vue3 frontend (no strict TypeScript; TS installed but JS used in .vue and .js files)
TypeScript 5.1.6 - Type definitions referenced in frontend toolchain only

Runtime

Python 3.13 (required: >=3.13 per Pipfile; Dockerfile uses python:3.11-slim-bullseye for container build)
Node.js 18 (Dockerfile node:18-alpine for frontend build stage)
Python: pipenv (development), pip / requirements.txt (Docker production)
Frontend: pnpm (lockfile: web/pnpm-lock.yaml present)
Lockfiles: Pipfile.lock (Python), web/pnpm-lock.yaml (frontend), uv.lock (uv-compatible)

Frameworks

FastAPI 0.111.0 - REST API framework (app/__init__.py factory via create_app())
Starlette 0.37.2 - ASGI underpinning (middleware, static files)
Uvicorn 0.34.0 - ASGI server (20 workers default, configurable via UVICORN_WORKERS)
Uvloop 0.21.0 - High-performance event loop (non-Windows only)
Tortoise-ORM 0.23.0 - Async ORM for MySQL (app/models/)
Aerich 0.8.1 - Database migrations for Tortoise-ORM (migrations/ directory)
aiomysql - Async MySQL driver (Tortoise backend)
clickhouse-connect 0.7.19 - ClickHouse async client (app/core/clickhouse.py)
asyncpg - Async PostgreSQL driver (listed in Pipfile but not actively used)
Pydantic 2.10.5 - Request/response schemas (app/schemas/)
pydantic-settings 2.7.1 - Settings management via env vars (app/settings/config.py)
orjson 3.10.14 - Fast JSON serialization
ujson 5.10.0 - Alternative JSON library
PyJWT 2.10.1 - JWT token generation/verification (HS256, 7-day expiry)
passlib 1.7.4 - Password hashing
argon2-cffi 23.1.0 - Argon2 password hasher
APScheduler - Async job scheduler (app/core/scheduler.py, 6 registered cron tasks)
httpx 0.28.1 - Async HTTP client (backend service-to-service calls)
requests - Sync HTTP client (spider scripts in jobs_spider/)
loguru 0.7.3 - Structured logging (unified throughout backend and spiders)
Vue 3.3.4 - SPA framework (web/src/main.js)
Vue Router 4.2.4 - Client-side routing (web/src/router/index.js)
Pinia 2.1.6 - State management (web/src/store/)
Naive UI 2.34.4 - Component library (admin UI)
ECharts 6.0.0 - Data visualization charts (web/src/views/analytics/)
axios 1.4.0 - HTTP client (web/src/utils/http/)
vue-i18n 9 - Internationalization
Vite 4.4.6 - Frontend bundler (web/vite.config.js)
UnoCSS 66.5.10 - Atomic CSS engine
unplugin-auto-import 20.3.0 - Auto-imports for Vue APIs
unplugin-vue-components 30.0.0 - Auto-imports for components
@iconify/vue + @iconify/json - Icon library
playwright 1.57.0 - Browser automation (anti-detection in crawlers)
PyExecJS 1.5.1 - Execute JavaScript signing algorithms from Python (jobs_spider/boss/)
PySocks - SOCKS proxy support for spider requests
tenacity - Retry logic in crawlers
pandas + openpyxl - Data processing and Excel export
alibabacloud_ecs20140526 - Alibaba Cloud ECS SDK (ecs_full_pipeline.py)
alibabacloud_credentials - AliCloud credential management
ruff 0.9.1 - Python linter (pyproject.toml config: line-length 120, ignores F403/F405)
black 24.10.0 - Python formatter (line-length 120, target py310/py311)
isort 5.13.2 - Import sorter
ESLint 8.46.0 - Frontend linter (@zclzone + @unocss rule sets)
prettier - Frontend formatter

Key Dependencies

clickhouse-connect==0.7.19 - All analytics and job data storage; loss means no data read/write
tortoise-orm==0.23.0 - All business data (users, roles, keywords, tokens); paired with aerich for migrations
fastapi==0.111.0 - API layer; version-pinned for stability
APScheduler - 6 scheduled tasks including ECS pipeline and IP alerting
alibabacloud_ecs20140526 - ECS node management; required for crawler scaling
uvicorn==0.34.0 + uvloop==0.21.0 - Production ASGI server stack
pydantic==2.10.5 - All input validation; v2 API used throughout
loguru==0.7.3 - Unified logging across all modules
redis - Optional distributed lock backend (app/core/locks.py; falls back to file locks if Redis unavailable)

Configuration

All settings in app/settings/config.py via pydantic-settings.BaseSettings
Environment variables override defaults at startup
No .env file detected (not committed); variables set at OS/container level
Key variables: APP_HOST, APP_PORT, UVICORN_WORKERS, CLICKHOUSE_HOST, CLICKHOUSE_USER, CLICKHOUSE_PASS, SECRET_KEY, SMTP_HOST, SMTP_USER, SMTP_PASS, REDIS_HOST, REPORT_ENDPOINT
Security warning: config.py contains hardcoded default values for MySQL, ClickHouse, and SMTP credentials; must be overridden in production
pyproject.toml - Python project metadata, black/ruff tool config
Pipfile / Pipfile.lock - Development dependency management
requirements.txt - Production pip install (used in Dockerfile)
web/vite.config.js - Vite build config with proxy support via VITE_USE_PROXY env var
Dockerfile - Multi-stage build: Node 18 (frontend) + Python 3.11 (backend) + nginx

Platform Requirements

Python 3.13+ (Pipfile requirement)
Node.js 18+ with pnpm
MySQL server (Tortoise-ORM connection)
ClickHouse server (analytics data)
Optional: Redis (distributed locking upgrade)
Docker-based deployment (see Dockerfile, deploy/entrypoint.sh, deploy/web.conf)
Nginx serves frontend static files and proxies API requests
Alibaba Cloud ECS (cn-shanghai-b zone) for crawler nodes
Ports: 80 (nginx), 9999 (FastAPI backend direct)

Conventions

Naming Patterns

Snake_case for all Python files: company_storage.py, company_cleaner.py, clickhouse_repo.py
Private/internal modules prefixed with underscore: _base.py, _boss_api.py, _boss_client.py, _boss_sign.py, _http_client.py
Platform-named service files: boss.py, qcwy.py, zhilian.py under app/services/crawler/
Router files named after domain: keyword.py, analytics.py, cleaning.py
PascalCase throughout: CleaningService, KeywordController, ClickHouseBaseRepo, JobAnalyticsRepo
Services: {Domain}Service — BossService, QcwyService, ZhilianService, IngestService, AnalyticsService
Controllers: {Domain}Controller — KeywordController
Repos: {Domain}Repo or {Domain}BaseRepo — ClickHouseBaseRepo, JobAnalyticsRepo
Models (Tortoise ORM): {Platform}{Entity} — BossKeyword, QcwyCompany, ZhilianCompany
Schemas (Pydantic): {Entity}Base, {Entity}Create, {Entity}Update, {Entity}Out — see app/schemas/keyword.py
Snake_case for all functions and methods: get_available, report_page_progress, store_batch, build_insert_row
Private helpers prefixed with underscore: _apply_proxy, _ensure_boss_token_loaded, _pick_first, _nested_get, _clean_text, _model_for_source
Async dependency factories follow pattern get_{service/controller}(): get_ingest_service, get_analytics_service, get_keyword_controller
Snake_case: data_list, platform_type, check_duplicate, page_size
Module-level constants: UPPER_SNAKE_CASE — COMPANY_SOURCES, QUEUE_TERMINAL_STATUSES
Class-level constants: UPPER_SNAKE_CASE prefixed _ — _TOKEN_REFRESH_INTERVAL = 3600
Enums use PascalCase class name, UPPER_SNAKE_CASE values: PlatformType.BOSS, ChannelType.MINI, DataType.JOB
Enum values are lowercase strings matching URL slugs: "boss", "mini", "job" — see app/schemas/ingest.py
Enums inherit from (str, Enum) enabling direct string comparison

Code Style

Tool: black v24.10.0
Line length: 120 characters (set in pyproject.toml [tool.black] and [tool.ruff])
Target Python versions: 3.10, 3.11 (black), 3.13 (Pipfile)
Tool: ruff v0.9.1 (configured in pyproject.toml)
Ignored rules: F403 (star imports), F405 (may be undefined from star import)
Star imports from internal modules are allowed (used in app/models/__init__.py, app/services/ingest/__init__.py)
Tool: isort v5.13.2
No explicit isort config found; follows default ordering

Import Organization

None; all imports use full app. prefix paths
from app.log import logger is the canonical loguru import path
Used only in __init__.py re-export files: from .admin import * in app/models/__init__.py
# noqa: F401, F403 comments suppress lint warnings for intentional star imports

Error Handling

Services return Dict[str, Any] result objects with "success", "code", "message" fields instead of raising exceptions to callers
Controllers return dict with "code": 200/400/404 and "message" for all outcomes
API route handlers do NOT use try/except — they rely on services returning structured results
Service methods wrap low-level calls in try/except Exception as e and log then return False or error dict

Logging

logger.info(f"...") for normal operation events
logger.warning(f"...") for non-fatal recoverable issues (e.g., token not found, API soft failures)
logger.error(f"...") for caught exceptions and operation failures
F-string interpolation used consistently for message formatting
No structured fields (no logger.bind() usage observed)

API Response Format

Comments

Docstrings on public methods describing purpose, not implementation: """获取可用关键词，优先返回断点续爬和失败重试的关键词"""
Inline comments for priority logic and algorithm steps: # 优先级 1: 断点续爬 (partial)
Module-level docstrings for context: """Boss直聘 Service — 基于新算法文件的封装"""
# noqa comments for intentional lint suppressions
Not applicable (Python backend)
Docstrings are brief single-line or short multi-line Chinese descriptions

Function Design

Keyword arguments with defaults preferred for optional params
Pydantic schemas used for HTTP request bodies (never raw dicts from router params)
Optional[str] with = None default for optional parameters
Services return Dict[str, Any] with consistent keys (code, message, data)
Private helpers return Optional[T] or primitive types
Async functions return awaitable results (no mixing of sync/async)

Module Design

app/models/__init__.py uses from .{module} import * to flatten model imports
Router modules export a single named router variable: router = APIRouter(...) or {domain}_router = APIRouter(...)
Service classes are imported directly by name
app/models/__init__.py — re-exports all model classes
app/services/ingest/__init__.py — re-exports IngestService and config registrations
app/api/v1/__init__.py — aggregates all routers into v1_router
FastAPI Depends() used for service/controller instantiation in route handlers
Dependency factory functions named get_{service}() and defined in the same file as the router
Shared auth dependencies: DependAuth, DependPermission in app/core/dependency.py

Tortoise ORM Model Conventions

Pydantic Schema Conventions

All schemas inherit from pydantic.BaseModel
All fields use Field(...) with description= for documentation
Enums inherit from (str, Enum) for JSON serialization compatibility
Output schemas include class Config: from_attributes = True to support ORM mode
Validation patterns use Field(..., pattern="^(boss|qcwy|zhilian)$") for enum-like string fields

Architecture

Pattern Overview

FastAPI async backend with dual-database persistence (MySQL for RBAC/metadata, ClickHouse for job/company analytics data)
External crawlers (jobs_spider/, spiderJobs/) run as independent processes and push data to the backend API via HTTP POST — no shared code path at runtime
RBAC permission model enforced at router-level via FastAPI Depends
APScheduler drives all recurring automation (cleaning, ECS orchestration, alerting) within the main process
Registry pattern (app/services/ingest/registry.py) maps (platform, channel, data_type) tuples to ClickHouse table configs + dedup rules

Layers

Purpose: Parse HTTP request, validate via Pydantic schema, delegate to service or controller, return JSON envelope
Location: app/api/v1/
Contains: One sub-package per domain (e.g., job/, cleaning/, analytics/, keyword/)
Depends on: Services, Controllers, app/core/dependency.py for auth
Used by: FastAPI application via app/api/v1/__init__.py → app/__init__.py
Purpose: Business logic, orchestration across repos/models, side-effect coordination
Location: app/services/
Contains: IngestService, AnalyticsService, CleaningService, CompanyCleaner, CompanyJobsSyncService, platform crawler service wrappers (crawler/boss.py, crawler/qcwy.py, crawler/zhilian.py)
Depends on: Repositories, ORM models, clickhouse_manager, external HTTP clients
Used by: Routers, APScheduler jobs
Purpose: Thin CRUD orchestration for MySQL-backed entities (User, Role, API, Menu, Dept, etc.)
Location: app/controllers/
Contains: user_controller, api_controller, role_controller, menu_controller, dept_controller, keyword_controller, proxy_controller, token_controller, cleaning_controller, company_controller
Depends on: Tortoise-ORM models
Used by: Routers
Purpose: Encapsulate raw ClickHouse query execution behind typed methods
Location: app/repositories/
Contains: ClickHouseBaseRepo (generic query/insert), JobAnalyticsRepo (trend, distribution, count queries against job_analytics view)
Depends on: clickhouse_connect.driver.AsyncClient
Used by: AnalyticsService
Purpose: ORM schema definitions for MySQL, migrated via Aerich
Location: app/models/
Contains: admin.py (User, Role, Api, Menu, Dept, AuditLog), token.py (BossToken), cleaning.py (CleaningTask), metrics.py (ScheduledTaskRun, StatsTotal, IpUploadStats), company.py (CompanyCleaningQueue), keyword.py (BossKeyword, QcwyKeyword, ZhilianKeyword)
Depends on: tortoise-orm
Used by: Controllers, Services, Scheduler jobs
Purpose: App wiring, connection management, middleware, cross-cutting utilities
Location: app/core/
Contains: init_app.py (lifespan bootstrap), scheduler.py (APScheduler jobs), clickhouse.py (ClickHouseManager singleton), clickhouse_init.py (ClickHouse DDL), dependency.py (AuthControl, PermissionControl), locks.py (DistributedLock), middlewares.py (BackGroundTaskMiddleware, HttpAuditLogMiddleware, IpTrackingMiddleware), exceptions.py, crud.py, ip_tracking.py
Depends on: Everything below
Used by: app/__init__.py, all layers above
Purpose: Platform-agnostic data ingestion pipeline with registry-driven routing, dedup, and remote push
Location: app/services/ingest/
Contains: service.py (IngestService), registry.py (PlatformConfig registry), dedup.py (batch dedup filter), remote_push.py (async HTTP push to secondary target), configs/ (per-platform PlatformConfig registrations)
Depends on: clickhouse_connect, registry configs
Used by: app/api/v1/job/job.py router
Purpose: Wrap platform HTTP clients so backend scheduler can call crawlers directly (company cleaning, company search)
Location: app/services/crawler/
Contains: _base.py (BaseFetcher, BaseSearcher, ApiResult), _http_client.py, _boss_client.py, _boss_api.py, _boss_sign.py, _zhilian_client.py, _zhilian_api.py, _zhilian_sign.py, _job51_client.py, _job51_api.py, _job51_sign.py, boss.py, qcwy.py, zhilian.py
Depends on: requests (sync), platform sign/auth modules
Used by: CompanyCleaner, CompanyController
Purpose: Standalone crawlers running on ECS nodes; push data to backend via POST /api/v1/universal/data/batch-store-async
Location: jobs_spider/boss/boos_api.py, jobs_spider/qcwy/qcwy.py, jobs_spider/zhilian/zhilian_single.py, spiderJobs/
Depends on: requests, loguru, PyExecJS; no shared Python imports with app/
Used by: Scheduled via ecs_full_pipeline.py or manual launch

Data Flow

MySQL (via Tortoise-ORM): User sessions, RBAC entities, keyword crawl state, tokens, task run records
ClickHouse: All raw job/company JSON data + analytics views; no ORM — raw SQL via clickhouse_connect
Context variable CTX_USER_ID (Python contextvars.ContextVar) carries authenticated user ID within request scope (app/core/ctx.py)
File-based startup lock (.startup_lock/ dir) prevents multi-worker seed data collisions

Key Abstractions

Purpose: Maps (platform, channel, data_type) to target ClickHouse table and dedup field extractors
Examples: app/services/ingest/registry.py, app/services/ingest/configs/
Pattern: Immutable @dataclass(frozen=True), registered at module import time via register()
Purpose: Prevents concurrent execution of scheduled jobs across multiple Uvicorn workers
Examples: app/core/locks.py
Pattern: Redis SET NX with TTL; falls back to filesystem directory creation with TTL metadata
Purpose: Singleton lazy-init connection wrapper around clickhouse_connect.AsyncClient
Examples: app/core/clickhouse.py
Pattern: Global module-level instance clickhouse_manager, injected into services at call time via await clickhouse_manager.get_client()
Purpose: Abstract base for platform-specific API clients within the in-process crawler wrappers
Examples: app/services/crawler/_base.py
Pattern: Template method — subclasses implement _build_params(), base handles HTTP invocation and ApiResult parsing
Purpose: Common ORM base with BigIntField pk, to_dict() async serializer with M2M support
Examples: app/models/base.py
Pattern: Mixed-in with TimestampMixin (created_at, updated_at auto fields)

Entry Points

Location: run.py
Triggers: python run.py — reads APP_HOST, APP_PORT, UVICORN_WORKERS env vars; calls uvicorn.run("app:app")
Responsibilities: Configure uvicorn logging format, start ASGI server
Location: app/__init__.py → create_app()
Triggers: Uvicorn import of app:app
Responsibilities: Register middlewares, exception handlers, routers, static files; define lifespan
Location: app/__init__.py (lifespan context manager) + app/core/init_app.py (init_data())
Triggers: FastAPI startup event
Responsibilities: Tortoise-ORM init + schema generation → Aerich migrations → seed data (superuser, menus, APIs, roles) → ClickHouse table init → APScheduler start
Location: app/api/v1/__init__.py
Triggers: Imported by app/api/__init__.py → app/__init__.py
Responsibilities: Mount all domain routers under /api/v1/ with appropriate DependPermission guards
Location: ecs_full_pipeline.py
Triggers: Manual execution or scheduled via scheduler.py (ecs_full_pipeline_job) every 6 hours
Responsibilities: Create Alibaba Cloud ECS spot instances, push spider scripts, execute crawlers, terminate instances

Error Handling

Tortoise DoesNotExist → 404 via DoesNotExistHandle (registered in app/core/init_app.py)
Pydantic RequestValidationError → 422 with field-level detail via RequestValidationHandle
IntegrityError (DB constraint) → 400 via IntegrityHandle
Custom HTTPException (from app/core/exceptions.py) → structured JSON error via HttpExcHandle
Service layer exceptions logged with loguru then re-raised or returned as {"success": False, "error": "..."}
Scheduled jobs catch all exceptions, record failure via _record_task_run() to ScheduledTaskRun table, then continue

Cross-Cutting Concerns

GSD Workflow Enforcement

Before using Edit, Write, or other file-changing tools, start work through a GSD command so planning artifacts and execution context stay in sync.

Use these entry points:

/gsd:quick for small fixes, doc updates, and ad-hoc tasks
/gsd:debug for investigation and bug fixing
/gsd:execute-phase for planned phase work

Do not make direct repo edits outside a GSD workflow unless the user explicitly asks to bypass it.

Developer Profile

Profile not yet configured. Run /gsd:profile-user to generate your developer profile. This section is managed by generate-claude-profile -- do not edit manually.

28 KiB Raw Permalink Blame History Unescape Escape

JobData - 招聘数据采集与统计分析平台

变更记录 (Changelog)

项目愿景

架构总览

模块结构图

模块索引

运行与开发

后端启动

前端启动

ECS 批量爬虫部署

定时任务

测试策略

编码规范

AI 使用指引

Project

Constraints

Technology Stack

Languages

Runtime

Frameworks

Key Dependencies

Configuration

Platform Requirements

Conventions

Naming Patterns

Code Style

Import Organization

Error Handling

Logging

API Response Format

Comments

Function Design

Module Design

Tortoise ORM Model Conventions

Pydantic Schema Conventions

Architecture

Pattern Overview

Layers

Data Flow

Key Abstractions

Entry Points

Error Handling

Cross-Cutting Concerns

GSD Workflow Enforcement

Developer Profile

28 KiB

Raw Permalink Blame History