--- gsd_state_version: 1.0 milestone: v1.0 milestone_name: milestone status: unknown stopped_at: Phase 6 UI-SPEC approved last_updated: "2026-03-21T14:56:38.784Z" progress: total_phases: 7 completed_phases: 4 total_plans: 12 completed_plans: 10 --- # STATE: JobData 爬虫交互重构 **Last updated:** 2026-03-21T10:23Z **Milestone:** 爬虫交互重构 v1 **Stopped At:** Phase 6 UI-SPEC approved --- ## Project Reference **Core value:** 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse,定时完成公司信息采集同步 **Current focus:** Phase 6 — 质量 & 前端 --- ## Current Position Phase: 6 Plan: Not started ## Performance Metrics | Metric | Value | |--------|-------| | Plans completed | 2 | | Plans total (estimated) | TBD | | Phases completed | 0/6 | | Requirements mapped | 19/19 | --- | Phase 01-shared-core P01 | 15min | 3 tasks | 8 files | | Phase 01-shared-core P02 | 15min | 2 tasks | 7 files | ## Accumulated Context ### Key Decisions | Decision | Rationale | Status | |----------|-----------|--------| | 先提取 crawler_core/ 再改平台代码 | 避免签名代码再次被复制粘贴 | Active | | feature flag 并行新旧代码 | 重构期间不中断在线爬虫 | Active | | jobs_spider/ 最终废弃 | 迁移到 spiderJobs/ + crawler_core | Active | | 同步核心 + asyncio.to_thread() 桥接 | requests_go 不支持 async,保持 TLS 指纹 | Active | | 30 天时间窗口去重 | 避免全表扫描 ClickHouse | Active | | pyproject.toml uses where=['..'] | pip install -e ./crawler_core resolves from repo root | Decided (01-01) | | tenacity min=10s wait | STACK.md mandatory 10s anti-detection delay preserved | Decided (01-01) | | stdlib logging in crawler_core | No loguru dependency; loguru bridge deferred to Phase 5 | Decided (01-01) | | Result[T] replaces ApiResult | Typed generic eliminates untyped Any in return values | Decided (01-01) | | Sign algorithms ported verbatim | Behavioral parity with originals; no regression risk | Decided (01-02) | | No tests/crawler_core/__init__.py | Would shadow crawler_core package causing import failures | Decided (01-02) | | conftest.py at project root | Reliable sys.path setup for pytest crawler_core imports | Decided (01-02) | ### Architecture Notes - `crawler_core/` 放项目根目录,`pip install -e ./crawler_core` 本地可安装 - 核心保持同步(requests_go),后端通过 asyncio.to_thread() 调用 - SmartIPManager 在 FastAPI 中需做成单例,避免会话状态丢失 - ClickHouse 表结构不变,MySQL 模型变更走 Aerich - 签名算法测试:41 个纯函数单元测试,无网络依赖,`pytest tests/crawler_core/` 即可运行 ### Known Risks | Risk | Mitigation | |------|-----------| | 重构期间打断在线爬虫 | feature flag 并行新旧代码,旧路径保留到 Phase 4 完成 | | 反爬会话状态丢失 | SmartIPManager 做单例,Phase 4 重点验证 | | 无测试基线 | Phase 1 先写签名纯函数测试,Phase 2 同步补 HTTP mock 测试 | ### Open TODOs - [ ] 确认 crawler_core/ 包的 Python 包名(建议 `crawler_core`,避免与系统包冲突) - [ ] 确认 feature flag 实现方式(环境变量 vs 配置文件) - [ ] Phase 3 可与 Phase 2 并行(两平台互相独立),视进度决定 ### Blockers None currently. --- ## Session Continuity **To resume this project:** 1. Check current phase in this file 2. Read `.planning/ROADMAP.md` for phase goals and success criteria 3. Read `.planning/REQUIREMENTS.md` for requirement coverage 4. Run `git status` to see uncommitted work 5. Start with `/gsd:plan-phase {current_phase}` if no active plan **Key files:** - `.planning/ROADMAP.md` — Phase structure and success criteria - `.planning/REQUIREMENTS.md` — All v1 requirements with traceability - `.planning/PROJECT.md` — Project context and constraints - `crawler_core/` — CREATED: shared package with HTTPClient, BaseFetcher, BaseSearcher, Result[T], plus sign algorithm subpackages - `tests/crawler_core/` — CREATED: 41 pure-function sign algorithm unit tests - `spiderJobs/` — external scripts using crawler_core - `app/services/crawler/` — backend facade layer --- *State initialized: 2026-03-21*