- ARCH-04: job51 migrated to crawler_core (no old deps) - ARCH-05: zhilian migrated to crawler_core (no old deps) - 34 new mock tests (17 job51 + 17 zhilian) - Added _parse_zhilian_response custom parser for zhilian API format - Fixed POST Searcher _request() overrides for job51/zhilian - Full regression: 98 passed in 0.12s
8.6 KiB
External Integrations
Analysis Date: 2026-03-21
APIs & External Services
Recruitment Platforms (scraped, not official APIs):
-
Boss Zhipin (Boss直聘) - Job and company data via WeChat mini-program endpoints
- Client:
requests(sync) injobs_spider/boss/boos_api.py - Endpoints scraped:
/wapi/zpgeek/miniapp/search/joblist.json,/wapi/batch/batchRunV3,/wapi/zpgeek/miniapp/brand/detail.json - Auth: MPT Token (stored in MySQL
boss_tokentable, managed viaapp/api/v1/token/) - Anti-detection:
SmartIPManager,IPAnomalyDetector, random 10-20s delays, session rebuilding on ban - JS signing:
PyExecJSexecutes JS sign algorithms (jobs_spider/boss/boos_api.py)
- Client:
-
51job / Qiancheng Wuyou (前程无忧) - Job data via platform API
- Client:
httpx+requestsinjobs_spider/qcwy/qcwy.py - Auth: HMAC signature computed per-request
- Proxy: Hardcoded proxy URL present in
jobs_spider/qcwy/qcwy.py(line 31) - security risk
- Client:
-
Zhilian Zhaopin (智联招聘) - Job and company data
- Client:
requestsinjobs_spider/zhilian/zhilian_single.py - Sign client:
app/services/crawler/_zhilian_sign.py
- Client:
Qixin.com (企信) - Company data enrichment (disabled):
- Endpoint:
http://external-data.qixin.com/extend/extend_data_push - Auth: MD5 HMAC with timestamp + salt (salt hardcoded in
app/services/ingest/remote_push.pyline 48) - Status: Push code commented out; function
push_to_remote()currently only logs
Webhook / Report Endpoint:
- Outgoing: Configurable via
REPORT_ENDPOINTenv var (default: none) - Used by:
app/core/scheduler.py_post_with_retry()to POST stats JSON every 6 hours - Timeout:
REPORT_TIMEOUT(default 10s), retries:REPORT_MAX_RETRIES(default 3)
Data Storage
Databases:
MySQL (Primary business database):
- Connection:
mysql://root:jobdata123@121.4.126.241:3306/job_data(default hardcoded inapp/settings/config.py; override viaTORTOISE_ORMenv or individual components) - Client: Tortoise-ORM 0.23.0 with aiomysql driver
- Migrations: Aerich (
migrations/directory, runs on startup ifRUN_MIGRATIONS_ON_STARTUP=True) - Tables:
user,role,api,menu,dept,auditlog,boss_token,cleaning_*,scheduled_task_run,stats_total,ip_upload_stats - Models:
app/models/admin.py,app/models/token.py,app/models/cleaning.py,app/models/metrics.py,app/models/keyword.py,app/models/company.py
ClickHouse (Analytics / raw job data):
- Connection: host
121.4.126.241, port8123, DBjob_data(configured inapp/settings/config.py) - Env vars:
CLICKHOUSE_HOST,CLICKHOUSE_PORT,CLICKHOUSE_USER,CLICKHOUSE_PASS,CLICKHOUSE_DB - Client:
clickhouse-connect==0.7.19async client (app/core/clickhouse.py) - Connection pool:
urllib3.PoolManagerwithnum_pools=2,maxsize=5per worker - Tables managed via DDL in
app/core/clickhouse_init.py(NOT Aerich migrations) - Tables:
boss_job,boss_company,qcwy_job,qcwy_company,zhilian_job,zhilian_company(MergeTree engine),pending_company(ReplacingMergeTree),job_analytics(VIEW) - Queries: Repository pattern in
app/repositories/clickhouse_repo.py
File Storage:
- Local filesystem only (no S3 / object storage detected)
- Static files served from
static/directory mounted via FastAPI (app/__init__.py) - Spider logs written to
logs/directory relative to spider working dir - ECS pipeline log:
ecs_full_pipeline.log(root project directory)
Caching:
- Redis (optional): Used only for distributed locking in
app/core/locks.py- Env vars:
REDIS_HOST,REDIS_PORT(default 6379),REDIS_DB(default 0),REDIS_PASS - Falls back to filesystem-based TTL locks when Redis unavailable
- Lock files stored in system temp directory:
{tempdir}/jobdata_lock_{name}/
- Env vars:
Authentication & Identity
Auth Provider: Custom JWT implementation (no third-party auth provider)
- Implementation:
app/core/dependency.py(DependPermissiondependency) - Algorithm: HS256
- Expiry: 7 days (60 * 24 * 7 minutes,
app/settings/config.py) - Secret:
SECRET_KEY(default"CHANGE_ME_DEV_ONLY"- must override in production) - Storage (frontend): JWT stored in
localStorage, injected via axios request interceptor (web/src/utils/http/interceptors.js) - RBAC: Route-level permission check via
DependPermission; roles link to APIs (many-to-many in MySQL)
Boss Token Management:
- Boss MPT tokens stored in
boss_tokenMySQL table (app/models/token.py) - Managed via
app/api/v1/token/CRUD endpoints - Spiders fetch active tokens from
/api/v1/token/tokensand mark invalid ones
Monitoring & Observability
Error Tracking: None (no Sentry or equivalent detected)
Logs:
- Backend:
logurustructured logging throughout all modules - Spider:
loguruwith daily rotation (logs/log_{YYYY-MM-DD}.log, 30-day retention) - ECS pipeline: File log appended to
ecs_full_pipeline.log - No centralized log aggregation (no ELK, CloudWatch, etc.)
Alerting:
- Email alerts via SMTP on: stats job completion, IP anomaly detection
- SMTP provider: QQ Mail (
smtp.qq.com:587) with STARTTLS - Env vars:
SMTP_HOST,SMTP_PORT,SMTP_USER,SMTP_PASS,SMTP_FROM,SMTP_TO - Default recipient hardcoded in
app/settings/config.pyline 47 - IP anomaly alerts triggered every 10 minutes by
ip_alert_jobinapp/core/scheduler.py - Stats reports sent every 6 hours by
stats_jobinapp/core/scheduler.py
Metrics:
- IP upload tracking:
IpTrackingMiddleware(app/core/ip_tracking.py) records per-source, per-IP upload counts toip_upload_statsMySQL table - Task run history:
ScheduledTaskRunmodel (app/models/metrics.py) records all scheduler task outcomes - Stats totals:
StatsTotalmodel records ClickHouse row counts per table per run
CI/CD & Deployment
Hosting:
- Production: Alibaba Cloud ECS (manually provisioned + Docker container)
- Crawler nodes: Alibaba Cloud ECS spot instances (cn-shanghai-b,
ecs.t5-lc1m1.small, Ubuntu 24.04) - ECS node config:
ecs_full_pipeline.py(20 spot instances,SpotAsPriceGostrategy, auto-release) - Vswitch:
vsw-uf6twb2lrd02fi7q2i605, Security Group:sg-uf60380ctrux2uusucad
CI Pipeline: None detected (no GitHub Actions, Jenkins, or equivalent)
Container:
- Multi-stage
Dockerfilein project root - Stage 1:
node:18-alpinebuilds Vue3 frontend - Stage 2:
python:3.11-slim-bullseyeinstalls Python deps + nginx - Entrypoint:
deploy/entrypoint.sh - Nginx config:
deploy/web.conf - Exposed port: 80
Environment Configuration
Required env vars (backend):
SECRET_KEY- JWT signing secret (must override"CHANGE_ME_DEV_ONLY")CLICKHOUSE_HOST- ClickHouse server addressCLICKHOUSE_USER/CLICKHOUSE_PASS- ClickHouse credentials- MySQL connection (embedded in
TORTOISE_ORMdict or override connection string)
Optional env vars:
APP_HOST/APP_PORT/UVICORN_WORKERS- Server bindingREDIS_HOST/REDIS_PORT/REDIS_PASS- Distributed lock backendSMTP_HOST/SMTP_USER/SMTP_PASS/SMTP_FROM/SMTP_TO- Email alertingREPORT_ENDPOINT- Webhook for stats reportingRUN_MIGRATIONS_ON_STARTUP/INITIALIZE_SEED_DATA_ON_STARTUP- Startup behavior flags
Spider env vars:
API_BASE_URL- Backend URL for data push (default:http://124.222.106.226:9999)SLEEP_MIN_SECONDS/SLEEP_MAX_SECONDS- Request rate limitingPROXY_USERNAME/PROXY_PASSWORD/PROXY_TUNNEL/PROXY_SCHEME- Proxy poolMAX_PAGES/PAGE_SIZE- Crawl depth control
Alibaba Cloud (ECS pipeline):
- Uses
alibabacloud_credentialsdefault credential chain (env varsALIBABA_CLOUD_ACCESS_KEY_ID/ALIBABA_CLOUD_ACCESS_KEY_SECRETor~/.alibabacloud/credentials)
Secrets location: All secrets are environment variables; no secrets manager. config.py contains hardcoded defaults that are insecure for production use.
Webhooks & Callbacks
Incoming:
/api/v1/universal/data/batch-store-async- Spider data ingestion endpoint (called by all three crawlers)- Header auth:
token: dev(orAPI_TOKENenv var) - Body:
{"data_list": [...], "data_type": "job"|"company", "platform": "boss"|"qcwy"|"zhilian"}
- Header auth:
/api/v1/keyword/available- Spiders poll this to get next keyword to crawl/api/v1/token/tokens- Spiders fetch active Boss MPT tokens/api/v1/pipeline- Manual trigger for ECS full pipeline
Outgoing:
REPORT_ENDPOINT(configurable) - Stats payload POST every 6 hours (app/core/scheduler.py)http://external-data.qixin.com/...- Company data push to Qixin (currently disabled/commented out inapp/services/ingest/remote_push.py)- SMTP email notifications on scheduler events
Integration audit: 2026-03-21