JobData/.planning/codebase/STRUCTURE.md
win 00a727519f docs(phase-3): complete execution — 2/2 plans, 98 tests passing
- ARCH-04: job51 migrated to crawler_core (no old deps)
- ARCH-05: zhilian migrated to crawler_core (no old deps)
- 34 new mock tests (17 job51 + 17 zhilian)
- Added _parse_zhilian_response custom parser for zhilian API format
- Fixed POST Searcher _request() overrides for job51/zhilian
- Full regression: 98 passed in 0.12s
2026-03-21 19:19:17 +08:00

20 KiB
Raw Blame History

Codebase Structure

Analysis Date: 2026-03-21

Directory Layout

JobData/                          # Project root
├── app/                          # FastAPI backend (Python)
│   ├── __init__.py               # App factory create_app()
│   ├── api/
│   │   └── v1/                   # All REST API routes (/api/v1/...)
│   │       ├── __init__.py       # Router aggregation + permission guards
│   │       ├── analytics.py      # Data analytics endpoints
│   │       ├── stats.py          # Table count stats endpoints
│   │       ├── pipeline.py       # ECS pipeline trigger endpoint
│   │       ├── base/             # Login, user-info, menu-tree (no auth)
│   │       ├── job/              # Data ingest endpoints (batch/single/async)
│   │       ├── cleaning/         # Targeted cleaning endpoints
│   │       ├── company/          # Company search endpoints
│   │       ├── keyword/          # Keyword CRUD endpoints
│   │       ├── token/            # Boss token management endpoints
│   │       ├── proxy/            # Proxy IP management endpoints
│   │       ├── users/            # User CRUD
│   │       ├── roles/            # Role + permission assignment
│   │       ├── menus/            # Menu tree CRUD
│   │       ├── apis/             # API registration management
│   │       ├── depts/            # Department tree CRUD
│   │       └── auditlog/         # Audit log query
│   ├── controllers/              # CRUD orchestration for MySQL entities
│   │   ├── api.py                # ApiController (refresh_api(), scan routes)
│   │   ├── user.py               # UserController
│   │   ├── role.py               # RoleController
│   │   ├── menu.py               # MenuController
│   │   ├── dept.py               # DeptController
│   │   ├── keyword.py            # KeywordController
│   │   ├── proxy.py              # ProxyController
│   │   ├── token.py              # TokenController
│   │   ├── cleaning.py           # CleaningController
│   │   └── company.py            # CompanyController (search via crawler services)
│   ├── core/                     # Framework infrastructure
│   │   ├── init_app.py           # lifespan: migrations, seed, ClickHouse init
│   │   ├── scheduler.py          # APScheduler job definitions + registration
│   │   ├── clickhouse.py         # ClickHouseManager singleton
│   │   ├── clickhouse_init.py    # ClickHouse DDL (tables + analytics view)
│   │   ├── dependency.py         # AuthControl, PermissionControl, DependPermission
│   │   ├── locks.py              # DistributedLock (Redis + file fallback)
│   │   ├── middlewares.py        # BackGroundTaskMiddleware, HttpAuditLogMiddleware
│   │   ├── ip_tracking.py        # IpTrackingMiddleware
│   │   ├── exceptions.py         # Custom exception classes + handlers
│   │   ├── crud.py               # Generic ListParams, pagination helpers
│   │   ├── ctx.py                # CTX_USER_ID ContextVar
│   │   ├── bgtask.py             # BgTasks helper for BackGroundTaskMiddleware
│   │   ├── proxy_rule.py         # Proxy selection rules
│   │   └── algorithms/
│   │       └── antispider.py     # Anti-spider signature helpers
│   ├── models/                   # Tortoise-ORM models (MySQL)
│   │   ├── base.py               # BaseModel, TimestampMixin, UUIDModel
│   │   ├── admin.py              # User, Role, Api, Menu, Dept, AuditLog
│   │   ├── token.py              # BossToken
│   │   ├── cleaning.py           # CleaningTask / CleaningRecord
│   │   ├── metrics.py            # ScheduledTaskRun, StatsTotal, IpUploadStats
│   │   ├── company.py            # CompanyCleaningQueue
│   │   ├── keyword.py            # BossKeyword, QcwyKeyword, ZhilianKeyword
│   │   └── enums.py              # MethodType enum
│   ├── repositories/             # ClickHouse query layer
│   │   └── clickhouse_repo.py    # ClickHouseBaseRepo, JobAnalyticsRepo
│   ├── schemas/                  # Pydantic request/response schemas
│   │   ├── ingest.py             # IngestBatchRequest, PlatformType, DataType enums
│   │   ├── analytics.py          # Analytics query params + response schemas
│   │   ├── keyword.py            # Keyword schemas
│   │   ├── token.py              # Token schemas
│   │   ├── base.py               # Common response envelopes
│   │   ├── login.py              # Login request/response
│   │   ├── users.py              # User schemas
│   │   ├── roles.py              # Role schemas
│   │   ├── menus.py              # Menu schemas (MenuType enum)
│   │   ├── apis.py               # API schemas
│   │   ├── depts.py              # Dept schemas
│   │   ├── cleaning.py           # Cleaning schemas
│   │   └── proxy.py              # Proxy schemas
│   ├── services/                 # Business logic
│   │   ├── analytics_service.py  # AnalyticsService → JobAnalyticsRepo
│   │   ├── cleaning.py           # CleaningService (targeted cleaning)
│   │   ├── company_cleaner.py    # CompanyCleaner (automated company pipeline)
│   │   ├── company_jobs_sync.py  # CompanyJobsSyncService
│   │   ├── company_storage.py    # company_storage (ClickHouse company insert)
│   │   ├── ingest/               # Ingest pipeline subsystem
│   │   │   ├── service.py        # IngestService.store_batch/store_single/query_data
│   │   │   ├── registry.py       # PlatformConfig registry + get_config()
│   │   │   ├── dedup.py          # batch_dedup_filter() against ClickHouse
│   │   │   ├── remote_push.py    # Async HTTP push to secondary target
│   │   │   └── configs/          # Per-platform PlatformConfig registrations
│   │   └── crawler/              # In-process platform HTTP clients
│   │       ├── _base.py          # BaseFetcher, BaseSearcher, ApiResult
│   │       ├── _http_client.py   # HTTPClient wrapper (sync requests)
│   │       ├── _boss_client.py   # Boss HTTP client
│   │       ├── _boss_api.py      # Boss API methods
│   │       ├── _boss_sign.py     # Boss request signing
│   │       ├── _zhilian_client.py
│   │       ├── _zhilian_api.py
│   │       ├── _zhilian_sign.py
│   │       ├── _job51_client.py
│   │       ├── _job51_api.py
│   │       ├── _job51_sign.py
│   │       ├── boss.py           # BossService (high-level facade)
│   │       ├── qcwy.py           # QcwyService (high-level facade)
│   │       └── zhilian.py        # ZhilianService (high-level facade)
│   ├── settings/
│   │   └── config.py             # Settings (pydantic-settings BaseSettings)
│   ├── log/
│   │   └── __init__.py           # loguru logger export
│   └── utils/                    # Shared utilities
├── web/                          # Vue 3 frontend
│   ├── src/
│   │   ├── main.js               # App entry, Pinia + Router setup
│   │   ├── App.vue               # Root component
│   │   ├── api/                  # Axios API call modules
│   │   │   ├── index.js          # System management APIs
│   │   │   ├── analytics.js      # Analytics APIs
│   │   │   ├── keyword.js        # Keyword APIs
│   │   │   ├── proxy.js          # Proxy APIs
│   │   │   └── token.js          # Token APIs
│   │   ├── components/           # Shared components (CrudTable, CrudModal, QueryBar)
│   │   ├── layout/               # App shell (sidebar, topbar, tabs)
│   │   ├── router/               # Vue Router config + dynamic route loading
│   │   │   ├── index.js          # createRouter(), addDynamicRoutes()
│   │   │   ├── guard/            # Navigation guards (auth, loading, title)
│   │   │   └── routes/           # Static route definitions
│   │   ├── store/                # Pinia stores
│   │   │   └── modules/          # user, permission, app, tags
│   │   ├── utils/
│   │   │   ├── auth/             # JWT token read/write
│   │   │   ├── http/             # Axios instance + interceptors
│   │   │   └── storage/          # localStorage wrapper
│   │   └── views/                # Page-level Vue components
│   │       ├── login/            # Login page
│   │       ├── analytics/        # ECharts dashboard (trend + distribution)
│   │       ├── recruitment/      # Job data browser (boss/, qcwy/, zhilian/)
│   │       ├── cleaning/         # Cleaning operation + monitor pages
│   │       ├── keyword/          # Keyword management
│   │       ├── profile/          # User profile
│   │       └── system/           # RBAC admin (user, role, menu, api, dept, auditlog, proxy, token)
│   ├── vite.config.js            # Vite build config (plugins, proxy)
│   └── package.json              # Frontend dependencies
├── jobs_spider/                  # Standalone crawlers (external process)
│   ├── boss/
│   │   ├── boos_api.py           # BossZhipinAPI core + SmartIPManager + main (~2246 lines)
│   │   ├── company_spider.py     # Boss company batch spider
│   │   └── enumerate_combos.py   # City × keyword combo enumeration
│   ├── qcwy/
│   │   ├── qcwy.py               # QCWY API wrapper
│   │   ├── search_company_jobs.py
│   │   └── run_company_search.py
│   └── zhilian/
│       ├── zhilian_single.py     # Zhilian single-job spider
│       └── company_spider.py     # Zhilian company spider
├── spiderJobs/                   # Second-generation spider framework (structured)
│   ├── core/
│   │   ├── base.py               # Shared base classes
│   │   └── http_client.py        # HTTP client abstraction
│   ├── platforms/                # Platform-specific implementations
│   │   ├── boss/
│   │   ├── job51/
│   │   └── zhilian/
│   └── runner/                   # Execution runners
│       ├── api_client.py         # Backend API push client
│       ├── loop.py               # Job crawl loop runner
│       └── company_loop.py       # Company crawl loop runner
├── migrations/
│   └── models/                   # Aerich migration files (auto-generated)
├── static/                       # Served at /static/ (built frontend or assets)
├── deploy/                       # Deployment configs and sample images
├── docs/                         # Project documentation
├── ecs_full_pipeline.py          # Alibaba Cloud ECS batch provisioning script
├── create_ecs_instances.py       # ECS instance creation utilities
├── reclean_qcwy_jobs.py          # Standalone QCWY re-cleaning script
├── run.py                        # Uvicorn launch entry point
├── Pipfile / Pipfile.lock        # Python dependency management (pipenv)
├── pyproject.toml                # ruff + black + isort tool config
└── Dockerfile                    # Container build spec

Directory Purposes

app/api/v1/:

  • Purpose: One sub-package per API domain; each contains __init__.py exporting its APIRouter
  • Contains: Route handler functions, dependency injection wiring, Pydantic schema imports
  • Key files: app/api/v1/__init__.py (aggregation), app/api/v1/job/job.py (ingest routes), app/api/v1/analytics.py

app/controllers/:

  • Purpose: Business-operation classes for MySQL-backed entities; encapsulate Tortoise-ORM queries
  • Contains: One controller per domain entity; typically expose create, update, delete, list, get methods
  • Key files: app/controllers/api.py (scan and register FastAPI routes → DB)

app/services/ingest/:

  • Purpose: Self-contained data pipeline: registry lookup → dedup check → ClickHouse insert → remote push
  • Contains: service.py (main orchestrator), registry.py (config store), dedup.py, remote_push.py, configs/ (registrations per platform)
  • Key files: app/services/ingest/service.py, app/services/ingest/registry.py

app/services/crawler/:

  • Purpose: In-process HTTP clients for platform APIs, used by backend scheduler (company cleaning) and company search controller
  • Contains: Underscore-prefixed private modules (_boss_client.py, _boss_sign.py, etc.) and public service facades (boss.py, qcwy.py, zhilian.py)
  • Key files: app/services/crawler/_base.py (shared base classes)

app/core/:

  • Purpose: Framework plumbing — startup, middleware chain, auth, distributed locking, ClickHouse connection
  • Key files: app/core/dependency.py (auth guards), app/core/scheduler.py (all cron jobs), app/core/locks.py

app/models/:

  • Purpose: All Tortoise-ORM model definitions; migrated via Aerich (not ClickHouse)
  • Key files: app/models/admin.py (RBAC), app/models/keyword.py (crawl state tracking)

app/repositories/:

  • Purpose: Typed ClickHouse query encapsulation following the Repository pattern
  • Key files: app/repositories/clickhouse_repo.py

jobs_spider/:

  • Purpose: Standalone scripts that run on ECS nodes; communicate with backend only via HTTP
  • Key files: jobs_spider/boss/boos_api.py (2246-line core with SmartIPManager, BossZhipinAPI)

spiderJobs/:

  • Purpose: Newer modular spider framework; parallel architecture to jobs_spider/ with cleaner separation
  • Key files: spiderJobs/runner/api_client.py, spiderJobs/runner/loop.py

web/src/views/:

  • Purpose: One directory per route group; each view owns its local route definition in route.js
  • Key files: web/src/views/analytics/index.vue (ECharts dashboard), web/src/views/cleaning/monitor.vue

Key File Locations

Entry Points:

  • run.py: Uvicorn startup; reads env vars APP_HOST, APP_PORT, UVICORN_WORKERS
  • app/__init__.py: FastAPI app factory create_app() and lifespan context manager
  • web/src/main.js: Vue 3 app entry; mounts Pinia, Router, directives

Configuration:

  • app/settings/config.py: All backend config (Settings extends pydantic_settings.BaseSettings); override via env vars
  • web/vite.config.js: Vite build and dev server config
  • pyproject.toml: ruff, black, isort tool configuration

Core Logic:

  • app/core/init_app.py: Startup sequence (migrations, seed data, ClickHouse init)
  • app/core/scheduler.py: All APScheduler job functions and start_scheduler()
  • app/core/clickhouse_init.py: All ClickHouse CREATE TABLE IF NOT EXISTS DDL statements
  • app/services/ingest/registry.py: PlatformConfig dataclass + global registry dict
  • app/services/ingest/service.py: IngestService.store_batch() — main data ingestion path
  • app/services/company_cleaner.py: CompanyCleaner — automated company data pipeline

ClickHouse Schema:

  • app/core/clickhouse_init.py: Defines all 6 MergeTree tables and 1 analytics VIEW DDL; apply via ClickHouseInitializer.initialize_all_tables()

MySQL Migrations:

  • migrations/models/: Auto-generated by aerich migrate; applied via aerich upgrade or automatically on startup

Testing:

  • tests/test_company_jobs_sync.py
  • tests/test_company_storage.py

Naming Conventions

Files (Python):

  • Modules: snake_case.py (e.g., company_cleaner.py, clickhouse_init.py)
  • Private crawler modules: leading underscore prefix (e.g., _boss_client.py, _zhilian_sign.py)
  • Spider scripts in jobs_spider/: descriptive names without prefix (e.g., boos_api.py, zhilian_single.py)

Files (Vue/JS):

  • Pages: index.vue for primary view, monitor.vue / descriptive name for secondary views
  • API modules: snake_case.js matching domain (e.g., analytics.js, keyword.js)
  • Router guard files: Grouped in router/guard/ directory

Classes (Python):

  • Services: PascalCaseService (e.g., IngestService, AnalyticsService, CompanyCleaner)
  • Controllers: PascalCaseController with module-level singleton instance in snake_case (e.g., user_controller = UserController())
  • Repos: PascalCaseRepo (e.g., ClickHouseBaseRepo, JobAnalyticsRepo)
  • Models: PascalCase matching table concept (e.g., BossToken, CompanyCleaningQueue)

Directories:

  • Domain groupings: lowercase (e.g., cleaning/, recruitment/, keyword/)
  • Platform names used consistently: boss, qcwy, zhilian (lowercase) across all layers

Where to Add New Code

New API Endpoint:

  1. Create or update router file in app/api/v1/{domain}/
  2. Register the router in app/api/v1/__init__.py with v1_router.include_router(...)
  3. Run api_controller.refresh_api() on next app startup to sync new routes to the api table
  4. Add corresponding Pydantic schemas in app/schemas/{domain}.py

New Platform Data Type (Ingest):

  1. Define PlatformConfig with dedup fields in app/services/ingest/configs/
  2. Call register(config) at module import in app/services/ingest/configs/__init__.py
  3. ClickHouse table DDL goes in app/core/clickhouse_init.py under _TABLE_DDLS

New MySQL Model:

  1. Add model class to appropriate file in app/models/ (or create new file)
  2. Register the model module in app/settings/config.py under TORTOISE_ORM.apps.models.models
  3. Run aerich migrate && aerich upgrade (or rely on RUN_MIGRATIONS_ON_STARTUP=True)

New Scheduled Job:

  1. Define async job function in app/core/scheduler.py
  2. Wrap execution body with DistributedLock context from app/core/locks.py
  3. Add scheduler.add_job(...) call inside start_scheduler()
  4. Record run result via _record_task_run() to ScheduledTaskRun table

New Frontend Page:

  1. Create view component at web/src/views/{module}/index.vue
  2. Add route.js in the same directory with route meta
  3. Register the route in web/src/router/routes/ static routes or rely on backend menu system
  4. Add corresponding backend menu entry in app/core/init_app.pyinit_menus() (for seed data)

New In-Process Platform Crawler:

  1. Add private client modules _platform_client.py, _platform_api.py, _platform_sign.py under app/services/crawler/
  2. Create public facade platform.py extending or wrapping BaseFetcher / BaseSearcher from app/services/crawler/_base.py
  3. Instantiate in CompanyCleaner.__init__() if needed for company pipeline

Shared Utilities:

  • Backend Python utilities: app/utils/
  • Frontend JS utilities: web/src/utils/

Special Directories

.startup_lock/:

  • Purpose: File-based mutex directory used to prevent multi-worker race on startup seed data
  • Generated: Yes (created and deleted at startup)
  • Committed: No (in .gitignore)

migrations/:

  • Purpose: Aerich migration history for MySQL schema
  • Generated: Yes (by aerich migrate)
  • Committed: Yes

static/:

  • Purpose: Served at /static/ path; can hold built frontend assets or other static files
  • Generated: Partially (built frontend output can be placed here)
  • Committed: Yes (currently contains static assets)

.planning/:

  • Purpose: GSD planning documents (codebase analysis, phase plans)
  • Generated: Yes (by GSD tooling)
  • Committed: Yes

jobs_spider/*/logs/:

  • Purpose: Rotating daily log files for standalone spider processes
  • Generated: Yes (by loguru in spider scripts)
  • Committed: No

Structure analysis: 2026-03-21