JobData/.planning/STATE.md

119 lines
4.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
gsd_state_version: 1.0
milestone: v1.0
milestone_name: milestone
status: unknown
stopped_at: Phase 6 UI-SPEC approved
last_updated: "2026-03-21T14:56:38.784Z"
progress:
total_phases: 7
completed_phases: 4
total_plans: 12
completed_plans: 10
---
# STATE: JobData 爬虫交互重构
**Last updated:** 2026-03-21T10:23Z
**Milestone:** 爬虫交互重构 v1
**Stopped At:** Phase 6 UI-SPEC approved
---
## Project Reference
**Core value:** 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse定时完成公司信息采集同步
**Current focus:** Phase 6 — 质量 & 前端
---
## Current Position
Phase: 6
Plan: Not started
## Performance Metrics
| Metric | Value |
|--------|-------|
| Plans completed | 2 |
| Plans total (estimated) | TBD |
| Phases completed | 0/6 |
| Requirements mapped | 19/19 |
---
| Phase 01-shared-core P01 | 15min | 3 tasks | 8 files |
| Phase 01-shared-core P02 | 15min | 2 tasks | 7 files |
## Accumulated Context
### Key Decisions
| Decision | Rationale | Status |
|----------|-----------|--------|
| 先提取 crawler_core/ 再改平台代码 | 避免签名代码再次被复制粘贴 | Active |
| feature flag 并行新旧代码 | 重构期间不中断在线爬虫 | Active |
| jobs_spider/ 最终废弃 | 迁移到 spiderJobs/ + crawler_core | Active |
| 同步核心 + asyncio.to_thread() 桥接 | requests_go 不支持 async保持 TLS 指纹 | Active |
| 30 天时间窗口去重 | 避免全表扫描 ClickHouse | Active |
| pyproject.toml uses where=['..'] | pip install -e ./crawler_core resolves from repo root | Decided (01-01) |
| tenacity min=10s wait | STACK.md mandatory 10s anti-detection delay preserved | Decided (01-01) |
| stdlib logging in crawler_core | No loguru dependency; loguru bridge deferred to Phase 5 | Decided (01-01) |
| Result[T] replaces ApiResult | Typed generic eliminates untyped Any in return values | Decided (01-01) |
| Sign algorithms ported verbatim | Behavioral parity with originals; no regression risk | Decided (01-02) |
| No tests/crawler_core/__init__.py | Would shadow crawler_core package causing import failures | Decided (01-02) |
| conftest.py at project root | Reliable sys.path setup for pytest crawler_core imports | Decided (01-02) |
### Architecture Notes
- `crawler_core/` 放项目根目录,`pip install -e ./crawler_core` 本地可安装
- 核心保持同步requests_go后端通过 asyncio.to_thread() 调用
- SmartIPManager 在 FastAPI 中需做成单例,避免会话状态丢失
- ClickHouse 表结构不变MySQL 模型变更走 Aerich
- 签名算法测试41 个纯函数单元测试,无网络依赖,`pytest tests/crawler_core/` 即可运行
### Known Risks
| Risk | Mitigation |
|------|-----------|
| 重构期间打断在线爬虫 | feature flag 并行新旧代码,旧路径保留到 Phase 4 完成 |
| 反爬会话状态丢失 | SmartIPManager 做单例Phase 4 重点验证 |
| 无测试基线 | Phase 1 先写签名纯函数测试Phase 2 同步补 HTTP mock 测试 |
### Open TODOs
- [ ] 确认 crawler_core/ 包的 Python 包名(建议 `crawler_core`,避免与系统包冲突)
- [ ] 确认 feature flag 实现方式(环境变量 vs 配置文件)
- [ ] Phase 3 可与 Phase 2 并行(两平台互相独立),视进度决定
### Blockers
None currently.
---
## Session Continuity
**To resume this project:**
1. Check current phase in this file
2. Read `.planning/ROADMAP.md` for phase goals and success criteria
3. Read `.planning/REQUIREMENTS.md` for requirement coverage
4. Run `git status` to see uncommitted work
5. Start with `/gsd:plan-phase {current_phase}` if no active plan
**Key files:**
- `.planning/ROADMAP.md` — Phase structure and success criteria
- `.planning/REQUIREMENTS.md` — All v1 requirements with traceability
- `.planning/PROJECT.md` — Project context and constraints
- `crawler_core/` — CREATED: shared package with HTTPClient, BaseFetcher, BaseSearcher, Result[T], plus sign algorithm subpackages
- `tests/crawler_core/` — CREATED: 41 pure-function sign algorithm unit tests
- `spiderJobs/` — external scripts using crawler_core
- `app/services/crawler/` — backend facade layer
---
*State initialized: 2026-03-21*