JobData/.planning/STATE.md
win d7c8bec287 docs(01-01): complete crawler_core package plan — SUMMARY, STATE, ROADMAP updates
- Create 01-01-SUMMARY.md with implementation details and interface contracts
- STATE.md: advance to plan 2, record metrics, add decisions from plan 01
- ROADMAP.md: update phase 1 plan progress (1/2 plans complete)
- REQUIREMENTS.md: mark ARCH-01, ARCH-02, QUAL-04, QUAL-05 complete
- crawler_core/__init__.py: preserve linter-added try/except ImportError guard
2026-03-21 18:14:19 +08:00

112 lines
3.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
gsd_state_version: 1.0
milestone: v1.0
milestone_name: milestone
status: unknown
last_updated: "2026-03-21T10:12:08.079Z"
progress:
total_phases: 6
completed_phases: 0
total_plans: 2
completed_plans: 1
---
# STATE: JobData 爬虫交互重构
**Last updated:** 2026-03-21T10:12Z
**Milestone:** 爬虫交互重构 v1
**Stopped At:** Completed 01-shared-core Plan 01 (crawler_core package)
---
## Project Reference
**Core value:** 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse定时完成公司信息采集同步
**Current focus:** Phase 01 — shared-core
---
## Current Position
Phase: 01 (shared-core) — EXECUTING
Plan: 2 of 2
## Performance Metrics
| Metric | Value |
|--------|-------|
| Plans completed | 0 |
| Plans total (estimated) | TBD |
| Phases completed | 0/6 |
| Requirements mapped | 19/19 |
---
| Phase 01-shared-core P01 | 15min | 3 tasks | 8 files |
## Accumulated Context
### Key Decisions
| Decision | Rationale | Status |
|----------|-----------|--------|
| 先提取 crawler_core/ 再改平台代码 | 避免签名代码再次被复制粘贴 | Active |
| feature flag 并行新旧代码 | 重构期间不中断在线爬虫 | Active |
| jobs_spider/ 最终废弃 | 迁移到 spiderJobs/ + crawler_core | Active |
| 同步核心 + asyncio.to_thread() 桥接 | requests_go 不支持 async保持 TLS 指纹 | Active |
| 30 天时间窗口去重 | 避免全表扫描 ClickHouse | Active |
| pyproject.toml uses where=['..'] | pip install -e ./crawler_core resolves from repo root | Decided (01-01) |
| tenacity min=10s wait | STACK.md mandatory 10s anti-detection delay preserved | Decided (01-01) |
| stdlib logging in crawler_core | No loguru dependency; loguru bridge deferred to Phase 5 | Decided (01-01) |
| Result[T] replaces ApiResult | Typed generic eliminates untyped Any in return values | Decided (01-01) |
### Architecture Notes
- `crawler_core/` 放项目根目录,`pip install -e ./crawler_core` 本地可安装
- 核心保持同步requests_go后端通过 asyncio.to_thread() 调用
- SmartIPManager 在 FastAPI 中需做成单例,避免会话状态丢失
- ClickHouse 表结构不变MySQL 模型变更走 Aerich
### Known Risks
| Risk | Mitigation |
|------|-----------|
| 重构期间打断在线爬虫 | feature flag 并行新旧代码,旧路径保留到 Phase 4 完成 |
| 反爬会话状态丢失 | SmartIPManager 做单例Phase 4 重点验证 |
| 无测试基线 | Phase 1 先写签名纯函数测试Phase 2 同步补 HTTP mock 测试 |
### Open TODOs
- [ ] 确认 crawler_core/ 包的 Python 包名(建议 `crawler_core`,避免与系统包冲突)
- [ ] 确认 feature flag 实现方式(环境变量 vs 配置文件)
- [ ] Phase 3 可与 Phase 2 并行(两平台互相独立),视进度决定
### Blockers
None currently.
---
## Session Continuity
**To resume this project:**
1. Check current phase in this file
2. Read `.planning/ROADMAP.md` for phase goals and success criteria
3. Read `.planning/REQUIREMENTS.md` for requirement coverage
4. Run `git status` to see uncommitted work
5. Start with `/gsd:plan-phase {current_phase}` if no active plan
**Key files:**
- `.planning/ROADMAP.md` — Phase structure and success criteria
- `.planning/REQUIREMENTS.md` — All v1 requirements with traceability
- `.planning/PROJECT.md` — Project context and constraints
- `crawler_core/` — CREATED: shared package with HTTPClient, BaseFetcher, BaseSearcher, Result[T]
- `spiderJobs/` — external scripts using crawler_core
- `app/services/crawler/` — backend facade layer
---
*State initialized: 2026-03-21*