JobData/.planning/codebase/INTEGRATIONS.md
win 00a727519f docs(phase-3): complete execution — 2/2 plans, 98 tests passing
- ARCH-04: job51 migrated to crawler_core (no old deps)
- ARCH-05: zhilian migrated to crawler_core (no old deps)
- 34 new mock tests (17 job51 + 17 zhilian)
- Added _parse_zhilian_response custom parser for zhilian API format
- Fixed POST Searcher _request() overrides for job51/zhilian
- Full regression: 98 passed in 0.12s
2026-03-21 19:19:17 +08:00

8.6 KiB

External Integrations

Analysis Date: 2026-03-21

APIs & External Services

Recruitment Platforms (scraped, not official APIs):

  • Boss Zhipin (Boss直聘) - Job and company data via WeChat mini-program endpoints

    • Client: requests (sync) in jobs_spider/boss/boos_api.py
    • Endpoints scraped: /wapi/zpgeek/miniapp/search/joblist.json, /wapi/batch/batchRunV3, /wapi/zpgeek/miniapp/brand/detail.json
    • Auth: MPT Token (stored in MySQL boss_token table, managed via app/api/v1/token/)
    • Anti-detection: SmartIPManager, IPAnomalyDetector, random 10-20s delays, session rebuilding on ban
    • JS signing: PyExecJS executes JS sign algorithms (jobs_spider/boss/boos_api.py)
  • 51job / Qiancheng Wuyou (前程无忧) - Job data via platform API

    • Client: httpx + requests in jobs_spider/qcwy/qcwy.py
    • Auth: HMAC signature computed per-request
    • Proxy: Hardcoded proxy URL present in jobs_spider/qcwy/qcwy.py (line 31) - security risk
  • Zhilian Zhaopin (智联招聘) - Job and company data

    • Client: requests in jobs_spider/zhilian/zhilian_single.py
    • Sign client: app/services/crawler/_zhilian_sign.py

Qixin.com (企信) - Company data enrichment (disabled):

  • Endpoint: http://external-data.qixin.com/extend/extend_data_push
  • Auth: MD5 HMAC with timestamp + salt (salt hardcoded in app/services/ingest/remote_push.py line 48)
  • Status: Push code commented out; function push_to_remote() currently only logs

Webhook / Report Endpoint:

  • Outgoing: Configurable via REPORT_ENDPOINT env var (default: none)
  • Used by: app/core/scheduler.py _post_with_retry() to POST stats JSON every 6 hours
  • Timeout: REPORT_TIMEOUT (default 10s), retries: REPORT_MAX_RETRIES (default 3)

Data Storage

Databases:

MySQL (Primary business database):

  • Connection: mysql://root:jobdata123@121.4.126.241:3306/job_data (default hardcoded in app/settings/config.py; override via TORTOISE_ORM env or individual components)
  • Client: Tortoise-ORM 0.23.0 with aiomysql driver
  • Migrations: Aerich (migrations/ directory, runs on startup if RUN_MIGRATIONS_ON_STARTUP=True)
  • Tables: user, role, api, menu, dept, auditlog, boss_token, cleaning_*, scheduled_task_run, stats_total, ip_upload_stats
  • Models: app/models/admin.py, app/models/token.py, app/models/cleaning.py, app/models/metrics.py, app/models/keyword.py, app/models/company.py

ClickHouse (Analytics / raw job data):

  • Connection: host 121.4.126.241, port 8123, DB job_data (configured in app/settings/config.py)
  • Env vars: CLICKHOUSE_HOST, CLICKHOUSE_PORT, CLICKHOUSE_USER, CLICKHOUSE_PASS, CLICKHOUSE_DB
  • Client: clickhouse-connect==0.7.19 async client (app/core/clickhouse.py)
  • Connection pool: urllib3.PoolManager with num_pools=2, maxsize=5 per worker
  • Tables managed via DDL in app/core/clickhouse_init.py (NOT Aerich migrations)
  • Tables: boss_job, boss_company, qcwy_job, qcwy_company, zhilian_job, zhilian_company (MergeTree engine), pending_company (ReplacingMergeTree), job_analytics (VIEW)
  • Queries: Repository pattern in app/repositories/clickhouse_repo.py

File Storage:

  • Local filesystem only (no S3 / object storage detected)
  • Static files served from static/ directory mounted via FastAPI (app/__init__.py)
  • Spider logs written to logs/ directory relative to spider working dir
  • ECS pipeline log: ecs_full_pipeline.log (root project directory)

Caching:

  • Redis (optional): Used only for distributed locking in app/core/locks.py
    • Env vars: REDIS_HOST, REDIS_PORT (default 6379), REDIS_DB (default 0), REDIS_PASS
    • Falls back to filesystem-based TTL locks when Redis unavailable
    • Lock files stored in system temp directory: {tempdir}/jobdata_lock_{name}/

Authentication & Identity

Auth Provider: Custom JWT implementation (no third-party auth provider)

  • Implementation: app/core/dependency.py (DependPermission dependency)
  • Algorithm: HS256
  • Expiry: 7 days (60 * 24 * 7 minutes, app/settings/config.py)
  • Secret: SECRET_KEY (default "CHANGE_ME_DEV_ONLY" - must override in production)
  • Storage (frontend): JWT stored in localStorage, injected via axios request interceptor (web/src/utils/http/interceptors.js)
  • RBAC: Route-level permission check via DependPermission; roles link to APIs (many-to-many in MySQL)

Boss Token Management:

  • Boss MPT tokens stored in boss_token MySQL table (app/models/token.py)
  • Managed via app/api/v1/token/ CRUD endpoints
  • Spiders fetch active tokens from /api/v1/token/tokens and mark invalid ones

Monitoring & Observability

Error Tracking: None (no Sentry or equivalent detected)

Logs:

  • Backend: loguru structured logging throughout all modules
  • Spider: loguru with daily rotation (logs/log_{YYYY-MM-DD}.log, 30-day retention)
  • ECS pipeline: File log appended to ecs_full_pipeline.log
  • No centralized log aggregation (no ELK, CloudWatch, etc.)

Alerting:

  • Email alerts via SMTP on: stats job completion, IP anomaly detection
  • SMTP provider: QQ Mail (smtp.qq.com:587) with STARTTLS
  • Env vars: SMTP_HOST, SMTP_PORT, SMTP_USER, SMTP_PASS, SMTP_FROM, SMTP_TO
  • Default recipient hardcoded in app/settings/config.py line 47
  • IP anomaly alerts triggered every 10 minutes by ip_alert_job in app/core/scheduler.py
  • Stats reports sent every 6 hours by stats_job in app/core/scheduler.py

Metrics:

  • IP upload tracking: IpTrackingMiddleware (app/core/ip_tracking.py) records per-source, per-IP upload counts to ip_upload_stats MySQL table
  • Task run history: ScheduledTaskRun model (app/models/metrics.py) records all scheduler task outcomes
  • Stats totals: StatsTotal model records ClickHouse row counts per table per run

CI/CD & Deployment

Hosting:

  • Production: Alibaba Cloud ECS (manually provisioned + Docker container)
  • Crawler nodes: Alibaba Cloud ECS spot instances (cn-shanghai-b, ecs.t5-lc1m1.small, Ubuntu 24.04)
  • ECS node config: ecs_full_pipeline.py (20 spot instances, SpotAsPriceGo strategy, auto-release)
  • Vswitch: vsw-uf6twb2lrd02fi7q2i605, Security Group: sg-uf60380ctrux2uusucad

CI Pipeline: None detected (no GitHub Actions, Jenkins, or equivalent)

Container:

  • Multi-stage Dockerfile in project root
  • Stage 1: node:18-alpine builds Vue3 frontend
  • Stage 2: python:3.11-slim-bullseye installs Python deps + nginx
  • Entrypoint: deploy/entrypoint.sh
  • Nginx config: deploy/web.conf
  • Exposed port: 80

Environment Configuration

Required env vars (backend):

  • SECRET_KEY - JWT signing secret (must override "CHANGE_ME_DEV_ONLY")
  • CLICKHOUSE_HOST - ClickHouse server address
  • CLICKHOUSE_USER / CLICKHOUSE_PASS - ClickHouse credentials
  • MySQL connection (embedded in TORTOISE_ORM dict or override connection string)

Optional env vars:

  • APP_HOST / APP_PORT / UVICORN_WORKERS - Server binding
  • REDIS_HOST / REDIS_PORT / REDIS_PASS - Distributed lock backend
  • SMTP_HOST / SMTP_USER / SMTP_PASS / SMTP_FROM / SMTP_TO - Email alerting
  • REPORT_ENDPOINT - Webhook for stats reporting
  • RUN_MIGRATIONS_ON_STARTUP / INITIALIZE_SEED_DATA_ON_STARTUP - Startup behavior flags

Spider env vars:

  • API_BASE_URL - Backend URL for data push (default: http://124.222.106.226:9999)
  • SLEEP_MIN_SECONDS / SLEEP_MAX_SECONDS - Request rate limiting
  • PROXY_USERNAME / PROXY_PASSWORD / PROXY_TUNNEL / PROXY_SCHEME - Proxy pool
  • MAX_PAGES / PAGE_SIZE - Crawl depth control

Alibaba Cloud (ECS pipeline):

  • Uses alibabacloud_credentials default credential chain (env vars ALIBABA_CLOUD_ACCESS_KEY_ID / ALIBABA_CLOUD_ACCESS_KEY_SECRET or ~/.alibabacloud/credentials)

Secrets location: All secrets are environment variables; no secrets manager. config.py contains hardcoded defaults that are insecure for production use.

Webhooks & Callbacks

Incoming:

  • /api/v1/universal/data/batch-store-async - Spider data ingestion endpoint (called by all three crawlers)
    • Header auth: token: dev (or API_TOKEN env var)
    • Body: {"data_list": [...], "data_type": "job"|"company", "platform": "boss"|"qcwy"|"zhilian"}
  • /api/v1/keyword/available - Spiders poll this to get next keyword to crawl
  • /api/v1/token/tokens - Spiders fetch active Boss MPT tokens
  • /api/v1/pipeline - Manual trigger for ECS full pipeline

Outgoing:

  • REPORT_ENDPOINT (configurable) - Stats payload POST every 6 hours (app/core/scheduler.py)
  • http://external-data.qixin.com/... - Company data push to Qixin (currently disabled/commented out in app/services/ingest/remote_push.py)
  • SMTP email notifications on scheduler events

Integration audit: 2026-03-21