JobData/.planning/STATE.md

4.2 KiB
Raw Blame History

gsd_state_version, milestone, milestone_name, status, stopped_at, last_updated, progress
gsd_state_version milestone milestone_name status stopped_at last_updated progress
1.0 v1.0 milestone unknown Phase 6 UI-SPEC approved 2026-03-21T14:56:38.784Z
total_phases completed_phases total_plans completed_plans
7 4 12 10

STATE: JobData 爬虫交互重构

Last updated: 2026-03-21T10:23Z Milestone: 爬虫交互重构 v1 Stopped At: Phase 6 UI-SPEC approved


Project Reference

Core value: 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse定时完成公司信息采集同步

Current focus: Phase 6 — 质量 & 前端


Current Position

Phase: 6 Plan: Not started

Performance Metrics

Metric Value
Plans completed 2
Plans total (estimated) TBD
Phases completed 0/6
Requirements mapped 19/19

| Phase 01-shared-core P01 | 15min | 3 tasks | 8 files | | Phase 01-shared-core P02 | 15min | 2 tasks | 7 files |

Accumulated Context

Key Decisions

Decision Rationale Status
先提取 crawler_core/ 再改平台代码 避免签名代码再次被复制粘贴 Active
feature flag 并行新旧代码 重构期间不中断在线爬虫 Active
jobs_spider/ 最终废弃 迁移到 spiderJobs/ + crawler_core Active
同步核心 + asyncio.to_thread() 桥接 requests_go 不支持 async保持 TLS 指纹 Active
30 天时间窗口去重 避免全表扫描 ClickHouse Active
pyproject.toml uses where=['..'] pip install -e ./crawler_core resolves from repo root Decided (01-01)
tenacity min=10s wait STACK.md mandatory 10s anti-detection delay preserved Decided (01-01)
stdlib logging in crawler_core No loguru dependency; loguru bridge deferred to Phase 5 Decided (01-01)
Result[T] replaces ApiResult Typed generic eliminates untyped Any in return values Decided (01-01)
Sign algorithms ported verbatim Behavioral parity with originals; no regression risk Decided (01-02)
No tests/crawler_core/init.py Would shadow crawler_core package causing import failures Decided (01-02)
conftest.py at project root Reliable sys.path setup for pytest crawler_core imports Decided (01-02)

Architecture Notes

  • crawler_core/ 放项目根目录,pip install -e ./crawler_core 本地可安装
  • 核心保持同步requests_go后端通过 asyncio.to_thread() 调用
  • SmartIPManager 在 FastAPI 中需做成单例,避免会话状态丢失
  • ClickHouse 表结构不变MySQL 模型变更走 Aerich
  • 签名算法测试41 个纯函数单元测试,无网络依赖,pytest tests/crawler_core/ 即可运行

Known Risks

Risk Mitigation
重构期间打断在线爬虫 feature flag 并行新旧代码,旧路径保留到 Phase 4 完成
反爬会话状态丢失 SmartIPManager 做单例Phase 4 重点验证
无测试基线 Phase 1 先写签名纯函数测试Phase 2 同步补 HTTP mock 测试

Open TODOs

  • 确认 crawler_core/ 包的 Python 包名(建议 crawler_core,避免与系统包冲突)
  • 确认 feature flag 实现方式(环境变量 vs 配置文件)
  • Phase 3 可与 Phase 2 并行(两平台互相独立),视进度决定

Blockers

None currently.


Session Continuity

To resume this project:

  1. Check current phase in this file
  2. Read .planning/ROADMAP.md for phase goals and success criteria
  3. Read .planning/REQUIREMENTS.md for requirement coverage
  4. Run git status to see uncommitted work
  5. Start with /gsd:plan-phase {current_phase} if no active plan

Key files:

  • .planning/ROADMAP.md — Phase structure and success criteria
  • .planning/REQUIREMENTS.md — All v1 requirements with traceability
  • .planning/PROJECT.md — Project context and constraints
  • crawler_core/ — CREATED: shared package with HTTPClient, BaseFetcher, BaseSearcher, Result[T], plus sign algorithm subpackages
  • tests/crawler_core/ — CREATED: 41 pure-function sign algorithm unit tests
  • spiderJobs/ — external scripts using crawler_core
  • app/services/crawler/ — backend facade layer

State initialized: 2026-03-21