- Create 01-01-SUMMARY.md with implementation details and interface contracts - STATE.md: advance to plan 2, record metrics, add decisions from plan 01 - ROADMAP.md: update phase 1 plan progress (1/2 plans complete) - REQUIREMENTS.md: mark ARCH-01, ARCH-02, QUAL-04, QUAL-05 complete - crawler_core/__init__.py: preserve linter-added try/except ImportError guard
112 lines
3.6 KiB
Markdown
112 lines
3.6 KiB
Markdown
---
|
||
gsd_state_version: 1.0
|
||
milestone: v1.0
|
||
milestone_name: milestone
|
||
status: unknown
|
||
last_updated: "2026-03-21T10:12:08.079Z"
|
||
progress:
|
||
total_phases: 6
|
||
completed_phases: 0
|
||
total_plans: 2
|
||
completed_plans: 1
|
||
---
|
||
|
||
# STATE: JobData 爬虫交互重构
|
||
|
||
**Last updated:** 2026-03-21T10:12Z
|
||
**Milestone:** 爬虫交互重构 v1
|
||
**Stopped At:** Completed 01-shared-core Plan 01 (crawler_core package)
|
||
|
||
---
|
||
|
||
## Project Reference
|
||
|
||
**Core value:** 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse,定时完成公司信息采集同步
|
||
|
||
**Current focus:** Phase 01 — shared-core
|
||
|
||
---
|
||
|
||
## Current Position
|
||
|
||
Phase: 01 (shared-core) — EXECUTING
|
||
Plan: 2 of 2
|
||
|
||
## Performance Metrics
|
||
|
||
| Metric | Value |
|
||
|--------|-------|
|
||
| Plans completed | 0 |
|
||
| Plans total (estimated) | TBD |
|
||
| Phases completed | 0/6 |
|
||
| Requirements mapped | 19/19 |
|
||
|
||
---
|
||
| Phase 01-shared-core P01 | 15min | 3 tasks | 8 files |
|
||
|
||
## Accumulated Context
|
||
|
||
### Key Decisions
|
||
|
||
| Decision | Rationale | Status |
|
||
|----------|-----------|--------|
|
||
| 先提取 crawler_core/ 再改平台代码 | 避免签名代码再次被复制粘贴 | Active |
|
||
| feature flag 并行新旧代码 | 重构期间不中断在线爬虫 | Active |
|
||
| jobs_spider/ 最终废弃 | 迁移到 spiderJobs/ + crawler_core | Active |
|
||
| 同步核心 + asyncio.to_thread() 桥接 | requests_go 不支持 async,保持 TLS 指纹 | Active |
|
||
| 30 天时间窗口去重 | 避免全表扫描 ClickHouse | Active |
|
||
| pyproject.toml uses where=['..'] | pip install -e ./crawler_core resolves from repo root | Decided (01-01) |
|
||
| tenacity min=10s wait | STACK.md mandatory 10s anti-detection delay preserved | Decided (01-01) |
|
||
| stdlib logging in crawler_core | No loguru dependency; loguru bridge deferred to Phase 5 | Decided (01-01) |
|
||
| Result[T] replaces ApiResult | Typed generic eliminates untyped Any in return values | Decided (01-01) |
|
||
|
||
### Architecture Notes
|
||
|
||
- `crawler_core/` 放项目根目录,`pip install -e ./crawler_core` 本地可安装
|
||
- 核心保持同步(requests_go),后端通过 asyncio.to_thread() 调用
|
||
- SmartIPManager 在 FastAPI 中需做成单例,避免会话状态丢失
|
||
- ClickHouse 表结构不变,MySQL 模型变更走 Aerich
|
||
|
||
### Known Risks
|
||
|
||
| Risk | Mitigation |
|
||
|------|-----------|
|
||
| 重构期间打断在线爬虫 | feature flag 并行新旧代码,旧路径保留到 Phase 4 完成 |
|
||
| 反爬会话状态丢失 | SmartIPManager 做单例,Phase 4 重点验证 |
|
||
| 无测试基线 | Phase 1 先写签名纯函数测试,Phase 2 同步补 HTTP mock 测试 |
|
||
|
||
### Open TODOs
|
||
|
||
- [ ] 确认 crawler_core/ 包的 Python 包名(建议 `crawler_core`,避免与系统包冲突)
|
||
- [ ] 确认 feature flag 实现方式(环境变量 vs 配置文件)
|
||
- [ ] Phase 3 可与 Phase 2 并行(两平台互相独立),视进度决定
|
||
|
||
### Blockers
|
||
|
||
None currently.
|
||
|
||
---
|
||
|
||
## Session Continuity
|
||
|
||
**To resume this project:**
|
||
|
||
1. Check current phase in this file
|
||
2. Read `.planning/ROADMAP.md` for phase goals and success criteria
|
||
3. Read `.planning/REQUIREMENTS.md` for requirement coverage
|
||
4. Run `git status` to see uncommitted work
|
||
5. Start with `/gsd:plan-phase {current_phase}` if no active plan
|
||
|
||
**Key files:**
|
||
|
||
- `.planning/ROADMAP.md` — Phase structure and success criteria
|
||
- `.planning/REQUIREMENTS.md` — All v1 requirements with traceability
|
||
- `.planning/PROJECT.md` — Project context and constraints
|
||
- `crawler_core/` — CREATED: shared package with HTTPClient, BaseFetcher, BaseSearcher, Result[T]
|
||
- `spiderJobs/` — external scripts using crawler_core
|
||
- `app/services/crawler/` — backend facade layer
|
||
|
||
---
|
||
|
||
*State initialized: 2026-03-21*
|