JobData/.planning/STATE.md
2026-03-21 17:00:12 +08:00

3.4 KiB
Raw Blame History

STATE: JobData 爬虫交互重构

Last updated: 2026-03-21 Milestone: 爬虫交互重构 v1


Project Reference

Core value: 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse定时完成公司信息采集同步

Current focus: Phase 1 — 提取 crawler_core/ 共享核心包


Current Position

Field Value
Phase 1
Phase name 共享核心包
Plan None (not started)
Status Not started

Progress:

Phase 1 ░░░░░░░░░░░░░░░░░░░░  0%
Phase 2 ░░░░░░░░░░░░░░░░░░░░  0%
Phase 3 ░░░░░░░░░░░░░░░░░░░░  0%
Phase 4 ░░░░░░░░░░░░░░░░░░░░  0%
Phase 5 ░░░░░░░░░░░░░░░░░░░░  0%
Phase 6 ░░░░░░░░░░░░░░░░░░░░  0%
─────────────────────────────
Overall  0/6 phases complete

Performance Metrics

Metric Value
Plans completed 0
Plans total (estimated) TBD
Phases completed 0/6
Requirements mapped 19/19

Accumulated Context

Key Decisions

Decision Rationale Status
先提取 crawler_core/ 再改平台代码 避免签名代码再次被复制粘贴 Active
feature flag 并行新旧代码 重构期间不中断在线爬虫 Active
jobs_spider/ 最终废弃 迁移到 spiderJobs/ + crawler_core Active
同步核心 + asyncio.to_thread() 桥接 requests_go 不支持 async保持 TLS 指纹 Active
30 天时间窗口去重 避免全表扫描 ClickHouse Active

Architecture Notes

  • crawler_core/ 放项目根目录,pip install -e ./crawler_core 本地可安装
  • 核心保持同步requests_go后端通过 asyncio.to_thread() 调用
  • SmartIPManager 在 FastAPI 中需做成单例,避免会话状态丢失
  • ClickHouse 表结构不变MySQL 模型变更走 Aerich

Known Risks

Risk Mitigation
重构期间打断在线爬虫 feature flag 并行新旧代码,旧路径保留到 Phase 4 完成
反爬会话状态丢失 SmartIPManager 做单例Phase 4 重点验证
无测试基线 Phase 1 先写签名纯函数测试Phase 2 同步补 HTTP mock 测试

Open TODOs

  • 确认 crawler_core/ 包的 Python 包名(建议 crawler_core,避免与系统包冲突)
  • 确认 feature flag 实现方式(环境变量 vs 配置文件)
  • Phase 3 可与 Phase 2 并行(两平台互相独立),视进度决定

Blockers

None currently.


Session Continuity

To resume this project:

  1. Check current phase in this file
  2. Read .planning/ROADMAP.md for phase goals and success criteria
  3. Read .planning/REQUIREMENTS.md for requirement coverage
  4. Run git status to see uncommitted work
  5. Start with /gsd:plan-phase {current_phase} if no active plan

Key files:

  • .planning/ROADMAP.md — Phase structure and success criteria
  • .planning/REQUIREMENTS.md — All v1 requirements with traceability
  • .planning/PROJECT.md — Project context and constraints
  • crawler_core/ — (to be created) shared package
  • spiderJobs/ — external scripts using crawler_core
  • app/services/crawler/ — backend facade layer

State initialized: 2026-03-21