- ARCH-04: job51 migrated to crawler_core (no old deps) - ARCH-05: zhilian migrated to crawler_core (no old deps) - 34 new mock tests (17 job51 + 17 zhilian) - Added _parse_zhilian_response custom parser for zhilian API format - Fixed POST Searcher _request() overrides for job51/zhilian - Full regression: 98 passed in 0.12s
20 KiB
20 KiB
Codebase Structure
Analysis Date: 2026-03-21
Directory Layout
JobData/ # Project root
├── app/ # FastAPI backend (Python)
│ ├── __init__.py # App factory create_app()
│ ├── api/
│ │ └── v1/ # All REST API routes (/api/v1/...)
│ │ ├── __init__.py # Router aggregation + permission guards
│ │ ├── analytics.py # Data analytics endpoints
│ │ ├── stats.py # Table count stats endpoints
│ │ ├── pipeline.py # ECS pipeline trigger endpoint
│ │ ├── base/ # Login, user-info, menu-tree (no auth)
│ │ ├── job/ # Data ingest endpoints (batch/single/async)
│ │ ├── cleaning/ # Targeted cleaning endpoints
│ │ ├── company/ # Company search endpoints
│ │ ├── keyword/ # Keyword CRUD endpoints
│ │ ├── token/ # Boss token management endpoints
│ │ ├── proxy/ # Proxy IP management endpoints
│ │ ├── users/ # User CRUD
│ │ ├── roles/ # Role + permission assignment
│ │ ├── menus/ # Menu tree CRUD
│ │ ├── apis/ # API registration management
│ │ ├── depts/ # Department tree CRUD
│ │ └── auditlog/ # Audit log query
│ ├── controllers/ # CRUD orchestration for MySQL entities
│ │ ├── api.py # ApiController (refresh_api(), scan routes)
│ │ ├── user.py # UserController
│ │ ├── role.py # RoleController
│ │ ├── menu.py # MenuController
│ │ ├── dept.py # DeptController
│ │ ├── keyword.py # KeywordController
│ │ ├── proxy.py # ProxyController
│ │ ├── token.py # TokenController
│ │ ├── cleaning.py # CleaningController
│ │ └── company.py # CompanyController (search via crawler services)
│ ├── core/ # Framework infrastructure
│ │ ├── init_app.py # lifespan: migrations, seed, ClickHouse init
│ │ ├── scheduler.py # APScheduler job definitions + registration
│ │ ├── clickhouse.py # ClickHouseManager singleton
│ │ ├── clickhouse_init.py # ClickHouse DDL (tables + analytics view)
│ │ ├── dependency.py # AuthControl, PermissionControl, DependPermission
│ │ ├── locks.py # DistributedLock (Redis + file fallback)
│ │ ├── middlewares.py # BackGroundTaskMiddleware, HttpAuditLogMiddleware
│ │ ├── ip_tracking.py # IpTrackingMiddleware
│ │ ├── exceptions.py # Custom exception classes + handlers
│ │ ├── crud.py # Generic ListParams, pagination helpers
│ │ ├── ctx.py # CTX_USER_ID ContextVar
│ │ ├── bgtask.py # BgTasks helper for BackGroundTaskMiddleware
│ │ ├── proxy_rule.py # Proxy selection rules
│ │ └── algorithms/
│ │ └── antispider.py # Anti-spider signature helpers
│ ├── models/ # Tortoise-ORM models (MySQL)
│ │ ├── base.py # BaseModel, TimestampMixin, UUIDModel
│ │ ├── admin.py # User, Role, Api, Menu, Dept, AuditLog
│ │ ├── token.py # BossToken
│ │ ├── cleaning.py # CleaningTask / CleaningRecord
│ │ ├── metrics.py # ScheduledTaskRun, StatsTotal, IpUploadStats
│ │ ├── company.py # CompanyCleaningQueue
│ │ ├── keyword.py # BossKeyword, QcwyKeyword, ZhilianKeyword
│ │ └── enums.py # MethodType enum
│ ├── repositories/ # ClickHouse query layer
│ │ └── clickhouse_repo.py # ClickHouseBaseRepo, JobAnalyticsRepo
│ ├── schemas/ # Pydantic request/response schemas
│ │ ├── ingest.py # IngestBatchRequest, PlatformType, DataType enums
│ │ ├── analytics.py # Analytics query params + response schemas
│ │ ├── keyword.py # Keyword schemas
│ │ ├── token.py # Token schemas
│ │ ├── base.py # Common response envelopes
│ │ ├── login.py # Login request/response
│ │ ├── users.py # User schemas
│ │ ├── roles.py # Role schemas
│ │ ├── menus.py # Menu schemas (MenuType enum)
│ │ ├── apis.py # API schemas
│ │ ├── depts.py # Dept schemas
│ │ ├── cleaning.py # Cleaning schemas
│ │ └── proxy.py # Proxy schemas
│ ├── services/ # Business logic
│ │ ├── analytics_service.py # AnalyticsService → JobAnalyticsRepo
│ │ ├── cleaning.py # CleaningService (targeted cleaning)
│ │ ├── company_cleaner.py # CompanyCleaner (automated company pipeline)
│ │ ├── company_jobs_sync.py # CompanyJobsSyncService
│ │ ├── company_storage.py # company_storage (ClickHouse company insert)
│ │ ├── ingest/ # Ingest pipeline subsystem
│ │ │ ├── service.py # IngestService.store_batch/store_single/query_data
│ │ │ ├── registry.py # PlatformConfig registry + get_config()
│ │ │ ├── dedup.py # batch_dedup_filter() against ClickHouse
│ │ │ ├── remote_push.py # Async HTTP push to secondary target
│ │ │ └── configs/ # Per-platform PlatformConfig registrations
│ │ └── crawler/ # In-process platform HTTP clients
│ │ ├── _base.py # BaseFetcher, BaseSearcher, ApiResult
│ │ ├── _http_client.py # HTTPClient wrapper (sync requests)
│ │ ├── _boss_client.py # Boss HTTP client
│ │ ├── _boss_api.py # Boss API methods
│ │ ├── _boss_sign.py # Boss request signing
│ │ ├── _zhilian_client.py
│ │ ├── _zhilian_api.py
│ │ ├── _zhilian_sign.py
│ │ ├── _job51_client.py
│ │ ├── _job51_api.py
│ │ ├── _job51_sign.py
│ │ ├── boss.py # BossService (high-level facade)
│ │ ├── qcwy.py # QcwyService (high-level facade)
│ │ └── zhilian.py # ZhilianService (high-level facade)
│ ├── settings/
│ │ └── config.py # Settings (pydantic-settings BaseSettings)
│ ├── log/
│ │ └── __init__.py # loguru logger export
│ └── utils/ # Shared utilities
├── web/ # Vue 3 frontend
│ ├── src/
│ │ ├── main.js # App entry, Pinia + Router setup
│ │ ├── App.vue # Root component
│ │ ├── api/ # Axios API call modules
│ │ │ ├── index.js # System management APIs
│ │ │ ├── analytics.js # Analytics APIs
│ │ │ ├── keyword.js # Keyword APIs
│ │ │ ├── proxy.js # Proxy APIs
│ │ │ └── token.js # Token APIs
│ │ ├── components/ # Shared components (CrudTable, CrudModal, QueryBar)
│ │ ├── layout/ # App shell (sidebar, topbar, tabs)
│ │ ├── router/ # Vue Router config + dynamic route loading
│ │ │ ├── index.js # createRouter(), addDynamicRoutes()
│ │ │ ├── guard/ # Navigation guards (auth, loading, title)
│ │ │ └── routes/ # Static route definitions
│ │ ├── store/ # Pinia stores
│ │ │ └── modules/ # user, permission, app, tags
│ │ ├── utils/
│ │ │ ├── auth/ # JWT token read/write
│ │ │ ├── http/ # Axios instance + interceptors
│ │ │ └── storage/ # localStorage wrapper
│ │ └── views/ # Page-level Vue components
│ │ ├── login/ # Login page
│ │ ├── analytics/ # ECharts dashboard (trend + distribution)
│ │ ├── recruitment/ # Job data browser (boss/, qcwy/, zhilian/)
│ │ ├── cleaning/ # Cleaning operation + monitor pages
│ │ ├── keyword/ # Keyword management
│ │ ├── profile/ # User profile
│ │ └── system/ # RBAC admin (user, role, menu, api, dept, auditlog, proxy, token)
│ ├── vite.config.js # Vite build config (plugins, proxy)
│ └── package.json # Frontend dependencies
├── jobs_spider/ # Standalone crawlers (external process)
│ ├── boss/
│ │ ├── boos_api.py # BossZhipinAPI core + SmartIPManager + main (~2246 lines)
│ │ ├── company_spider.py # Boss company batch spider
│ │ └── enumerate_combos.py # City × keyword combo enumeration
│ ├── qcwy/
│ │ ├── qcwy.py # QCWY API wrapper
│ │ ├── search_company_jobs.py
│ │ └── run_company_search.py
│ └── zhilian/
│ ├── zhilian_single.py # Zhilian single-job spider
│ └── company_spider.py # Zhilian company spider
├── spiderJobs/ # Second-generation spider framework (structured)
│ ├── core/
│ │ ├── base.py # Shared base classes
│ │ └── http_client.py # HTTP client abstraction
│ ├── platforms/ # Platform-specific implementations
│ │ ├── boss/
│ │ ├── job51/
│ │ └── zhilian/
│ └── runner/ # Execution runners
│ ├── api_client.py # Backend API push client
│ ├── loop.py # Job crawl loop runner
│ └── company_loop.py # Company crawl loop runner
├── migrations/
│ └── models/ # Aerich migration files (auto-generated)
├── static/ # Served at /static/ (built frontend or assets)
├── deploy/ # Deployment configs and sample images
├── docs/ # Project documentation
├── ecs_full_pipeline.py # Alibaba Cloud ECS batch provisioning script
├── create_ecs_instances.py # ECS instance creation utilities
├── reclean_qcwy_jobs.py # Standalone QCWY re-cleaning script
├── run.py # Uvicorn launch entry point
├── Pipfile / Pipfile.lock # Python dependency management (pipenv)
├── pyproject.toml # ruff + black + isort tool config
└── Dockerfile # Container build spec
Directory Purposes
app/api/v1/:
- Purpose: One sub-package per API domain; each contains
__init__.pyexporting itsAPIRouter - Contains: Route handler functions, dependency injection wiring, Pydantic schema imports
- Key files:
app/api/v1/__init__.py(aggregation),app/api/v1/job/job.py(ingest routes),app/api/v1/analytics.py
app/controllers/:
- Purpose: Business-operation classes for MySQL-backed entities; encapsulate Tortoise-ORM queries
- Contains: One controller per domain entity; typically expose
create,update,delete,list,getmethods - Key files:
app/controllers/api.py(scan and register FastAPI routes → DB)
app/services/ingest/:
- Purpose: Self-contained data pipeline: registry lookup → dedup check → ClickHouse insert → remote push
- Contains:
service.py(main orchestrator),registry.py(config store),dedup.py,remote_push.py,configs/(registrations per platform) - Key files:
app/services/ingest/service.py,app/services/ingest/registry.py
app/services/crawler/:
- Purpose: In-process HTTP clients for platform APIs, used by backend scheduler (company cleaning) and company search controller
- Contains: Underscore-prefixed private modules (
_boss_client.py,_boss_sign.py, etc.) and public service facades (boss.py,qcwy.py,zhilian.py) - Key files:
app/services/crawler/_base.py(shared base classes)
app/core/:
- Purpose: Framework plumbing — startup, middleware chain, auth, distributed locking, ClickHouse connection
- Key files:
app/core/dependency.py(auth guards),app/core/scheduler.py(all cron jobs),app/core/locks.py
app/models/:
- Purpose: All Tortoise-ORM model definitions; migrated via Aerich (not ClickHouse)
- Key files:
app/models/admin.py(RBAC),app/models/keyword.py(crawl state tracking)
app/repositories/:
- Purpose: Typed ClickHouse query encapsulation following the Repository pattern
- Key files:
app/repositories/clickhouse_repo.py
jobs_spider/:
- Purpose: Standalone scripts that run on ECS nodes; communicate with backend only via HTTP
- Key files:
jobs_spider/boss/boos_api.py(2246-line core with SmartIPManager, BossZhipinAPI)
spiderJobs/:
- Purpose: Newer modular spider framework; parallel architecture to
jobs_spider/with cleaner separation - Key files:
spiderJobs/runner/api_client.py,spiderJobs/runner/loop.py
web/src/views/:
- Purpose: One directory per route group; each view owns its local route definition in
route.js - Key files:
web/src/views/analytics/index.vue(ECharts dashboard),web/src/views/cleaning/monitor.vue
Key File Locations
Entry Points:
run.py: Uvicorn startup; reads env varsAPP_HOST,APP_PORT,UVICORN_WORKERSapp/__init__.py: FastAPI app factorycreate_app()and lifespan context managerweb/src/main.js: Vue 3 app entry; mounts Pinia, Router, directives
Configuration:
app/settings/config.py: All backend config (Settingsextendspydantic_settings.BaseSettings); override via env varsweb/vite.config.js: Vite build and dev server configpyproject.toml: ruff, black, isort tool configuration
Core Logic:
app/core/init_app.py: Startup sequence (migrations, seed data, ClickHouse init)app/core/scheduler.py: All APScheduler job functions andstart_scheduler()app/core/clickhouse_init.py: All ClickHouseCREATE TABLE IF NOT EXISTSDDL statementsapp/services/ingest/registry.py:PlatformConfigdataclass + global registry dictapp/services/ingest/service.py:IngestService.store_batch()— main data ingestion pathapp/services/company_cleaner.py:CompanyCleaner— automated company data pipeline
ClickHouse Schema:
app/core/clickhouse_init.py: Defines all 6 MergeTree tables and 1 analytics VIEW DDL; apply viaClickHouseInitializer.initialize_all_tables()
MySQL Migrations:
migrations/models/: Auto-generated byaerich migrate; applied viaaerich upgradeor automatically on startup
Testing:
tests/test_company_jobs_sync.pytests/test_company_storage.py
Naming Conventions
Files (Python):
- Modules:
snake_case.py(e.g.,company_cleaner.py,clickhouse_init.py) - Private crawler modules: leading underscore prefix (e.g.,
_boss_client.py,_zhilian_sign.py) - Spider scripts in
jobs_spider/: descriptive names without prefix (e.g.,boos_api.py,zhilian_single.py)
Files (Vue/JS):
- Pages:
index.vuefor primary view,monitor.vue/ descriptive name for secondary views - API modules:
snake_case.jsmatching domain (e.g.,analytics.js,keyword.js) - Router guard files: Grouped in
router/guard/directory
Classes (Python):
- Services:
PascalCaseService(e.g.,IngestService,AnalyticsService,CompanyCleaner) - Controllers:
PascalCaseControllerwith module-level singleton instance in snake_case (e.g.,user_controller = UserController()) - Repos:
PascalCaseRepo(e.g.,ClickHouseBaseRepo,JobAnalyticsRepo) - Models:
PascalCasematching table concept (e.g.,BossToken,CompanyCleaningQueue)
Directories:
- Domain groupings: lowercase (e.g.,
cleaning/,recruitment/,keyword/) - Platform names used consistently:
boss,qcwy,zhilian(lowercase) across all layers
Where to Add New Code
New API Endpoint:
- Create or update router file in
app/api/v1/{domain}/ - Register the router in
app/api/v1/__init__.pywithv1_router.include_router(...) - Run
api_controller.refresh_api()on next app startup to sync new routes to theapitable - Add corresponding Pydantic schemas in
app/schemas/{domain}.py
New Platform Data Type (Ingest):
- Define
PlatformConfigwith dedup fields inapp/services/ingest/configs/ - Call
register(config)at module import inapp/services/ingest/configs/__init__.py - ClickHouse table DDL goes in
app/core/clickhouse_init.pyunder_TABLE_DDLS
New MySQL Model:
- Add model class to appropriate file in
app/models/(or create new file) - Register the model module in
app/settings/config.pyunderTORTOISE_ORM.apps.models.models - Run
aerich migrate && aerich upgrade(or rely onRUN_MIGRATIONS_ON_STARTUP=True)
New Scheduled Job:
- Define async job function in
app/core/scheduler.py - Wrap execution body with
DistributedLockcontext fromapp/core/locks.py - Add
scheduler.add_job(...)call insidestart_scheduler() - Record run result via
_record_task_run()toScheduledTaskRuntable
New Frontend Page:
- Create view component at
web/src/views/{module}/index.vue - Add
route.jsin the same directory with route meta - Register the route in
web/src/router/routes/static routes or rely on backend menu system - Add corresponding backend menu entry in
app/core/init_app.py→init_menus()(for seed data)
New In-Process Platform Crawler:
- Add private client modules
_platform_client.py,_platform_api.py,_platform_sign.pyunderapp/services/crawler/ - Create public facade
platform.pyextending or wrappingBaseFetcher/BaseSearcherfromapp/services/crawler/_base.py - Instantiate in
CompanyCleaner.__init__()if needed for company pipeline
Shared Utilities:
- Backend Python utilities:
app/utils/ - Frontend JS utilities:
web/src/utils/
Special Directories
.startup_lock/:
- Purpose: File-based mutex directory used to prevent multi-worker race on startup seed data
- Generated: Yes (created and deleted at startup)
- Committed: No (in
.gitignore)
migrations/:
- Purpose: Aerich migration history for MySQL schema
- Generated: Yes (by
aerich migrate) - Committed: Yes
static/:
- Purpose: Served at
/static/path; can hold built frontend assets or other static files - Generated: Partially (built frontend output can be placed here)
- Committed: Yes (currently contains static assets)
.planning/:
- Purpose: GSD planning documents (codebase analysis, phase plans)
- Generated: Yes (by GSD tooling)
- Committed: Yes
jobs_spider/*/logs/:
- Purpose: Rotating daily log files for standalone spider processes
- Generated: Yes (by loguru in spider scripts)
- Committed: No
Structure analysis: 2026-03-21