docs(phase-5): complete execution — 2/2 plans done
Summary files and roadmap/STATE updated. Phase 5 complete: DATA-01 (30-day dedup window) + DATA-04 (company channel) DATA-02 + DATA-03 confirmed already complete.
This commit is contained in:
parent
3d202c3486
commit
9756df6044
@ -18,10 +18,10 @@
|
||||
|
||||
### 数据入库 (DATA)
|
||||
|
||||
- [ ] **DATA-01**: 入库时按职位唯一 ID + 30 天时间窗口去重
|
||||
- [ ] **DATA-02**: 外部脚本和后端共用统一的 IngestService 入库管道
|
||||
- [ ] **DATA-03**: 公司信息定时清洗流程优化(保持 ClickHouse → 爬详情 → MySQL 流程)
|
||||
- [ ] **DATA-04**: 公司招聘信息爬取并写入 ClickHouse
|
||||
- [x] **DATA-01**: 入库时按职位唯一 ID + 30 天时间窗口去重
|
||||
- [x] **DATA-02**: 外部脚本和后端共用统一的 IngestService 入库管道
|
||||
- [x] **DATA-03**: 公司信息定时清洗流程优化(保持 ClickHouse → 爬详情 → MySQL 流程)
|
||||
- [x] **DATA-04**: 公司招聘信息爬取并写入 ClickHouse
|
||||
|
||||
### 质量保证 (QUAL)
|
||||
|
||||
@ -63,10 +63,10 @@
|
||||
| ARCH-06 | TBD | Complete |
|
||||
| ARCH-07 | TBD | Complete |
|
||||
| ARCH-08 | TBD | Complete |
|
||||
| DATA-01 | TBD | Pending |
|
||||
| DATA-02 | TBD | Pending |
|
||||
| DATA-03 | TBD | Pending |
|
||||
| DATA-04 | TBD | Pending |
|
||||
| DATA-01 | TBD | Complete |
|
||||
| DATA-02 | TBD | Complete |
|
||||
| DATA-03 | TBD | Complete |
|
||||
| DATA-04 | TBD | Complete |
|
||||
| QUAL-01 | Phase 1 | Complete |
|
||||
| QUAL-02 | TBD | Pending |
|
||||
| QUAL-03 | TBD | Complete |
|
||||
|
||||
@ -13,7 +13,7 @@
|
||||
- [x] **Phase 2: Boss 直聘重写** - 基于 crawler_core 重写 Boss 直聘爬虫客户端 (completed 2026-03-21)
|
||||
- [x] **Phase 3: 前程无忧 & 智联重写** - 基于 crawler_core 重写前程无忧和智联招聘爬虫客户端 (completed 2026-03-21)
|
||||
- [x] **Phase 4: 后端 & 外部脚本接入** - 后端 facade 桥接 + 外部脚本迁移 + 废弃老框架 (completed 2026-03-21)
|
||||
- [ ] **Phase 5: 数据管道优化** - 入库去重、公司清洗流程优化、公司招聘信息写入
|
||||
- [x] **Phase 5: 数据管道优化** - 入库去重、公司清洗流程优化、公司招聘信息写入 (completed 2026-03-21)
|
||||
- [ ] **Phase 6: 质量 & 前端** - 数据解析测试覆盖、前端监控和数据清洗页面优化
|
||||
|
||||
---
|
||||
@ -77,7 +77,7 @@ Plans:
|
||||
2. 外部脚本和后端均通过同一个 IngestService 入库,入库日志显示来源字段一致
|
||||
3. 公司清洗定时任务:ClickHouse 抽取 → 爬公司详情 → 写 MySQL,全程日志可追踪
|
||||
4. 公司招聘信息(company_jobs)成功写入 ClickHouse,前端或查询可验证数据存在
|
||||
**Plans:** TBD
|
||||
**Plans:** 0/2 plans complete
|
||||
|
||||
### Phase 6: 质量 & 前端
|
||||
**Goal:** 数据解析测试覆盖到位,前端监控和清洗页面对实际运维有帮助
|
||||
@ -100,7 +100,7 @@ Plans:
|
||||
| 2. Boss 直聘重写 | 2/2 | Complete | 2026-03-21 |
|
||||
| 3. 前程无忧 & 智联重写 | 2/2 | Complete | 2026-03-21 |
|
||||
| 4. 后端 & 外部脚本接入 | 2/2 | Complete | 2026-03-21 |
|
||||
| 5. 数据管道优化 | 0/? | Not started | - |
|
||||
| 5. 数据管道优化 | 0/2 | Complete | 2026-03-21 |
|
||||
| 6. 质量 & 前端 | 0/? | Not started | - |
|
||||
|
||||
---
|
||||
|
||||
@ -4,11 +4,11 @@ milestone: v1.0
|
||||
milestone_name: milestone
|
||||
status: unknown
|
||||
stopped_at: Completed 01-shared-core Plan 02 (sign algorithms + unit tests)
|
||||
last_updated: "2026-03-21T11:39:52.113Z"
|
||||
last_updated: "2026-03-21T11:50:32.268Z"
|
||||
progress:
|
||||
total_phases: 6
|
||||
completed_phases: 4
|
||||
total_plans: 8
|
||||
total_plans: 10
|
||||
completed_plans: 8
|
||||
---
|
||||
|
||||
@ -24,13 +24,13 @@ progress:
|
||||
|
||||
**Core value:** 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse,定时完成公司信息采集同步
|
||||
|
||||
**Current focus:** Phase 4 — 后端 & 外部脚本接入
|
||||
**Current focus:** Phase 5 — 数据管道优化
|
||||
|
||||
---
|
||||
|
||||
## Current Position
|
||||
|
||||
Phase: 5
|
||||
Phase: 06
|
||||
Plan: Not started
|
||||
|
||||
## Performance Metrics
|
||||
|
||||
47
.planning/phases/05-data-pipeline/05-01-SUMMARY.md
Normal file
47
.planning/phases/05-data-pipeline/05-01-SUMMARY.md
Normal file
@ -0,0 +1,47 @@
|
||||
# Phase 5 Summary: 数据管道优化
|
||||
|
||||
**Status:** Complete ✅
|
||||
**Tests:** 112 passed (6 new)
|
||||
**Commit:** 3d202c3
|
||||
|
||||
## Plan 01 — DATA-01: dedup 30天窗口(`dedup.py`)
|
||||
|
||||
**修复前:** 查全量历史数据去重,同一职位 ID 永久只入库一次
|
||||
|
||||
**修复后:**
|
||||
```sql
|
||||
-- 单字段(boss job_id 等)
|
||||
SELECT job_id FROM boss_job
|
||||
WHERE job_id IN {keys} AND created_at > now() - INTERVAL 30 DAY
|
||||
|
||||
-- 双字段(qcwy job_id + update_date_time)
|
||||
SELECT job_id, update_date_time FROM qcwy_job
|
||||
WHERE job_id IN {keys} AND created_at > now() - INTERVAL 30 DAY
|
||||
```
|
||||
|
||||
**新增测试(`tests/ingest/test_dedup.py`,6 个):**
|
||||
- 30 天内有记录 → 过滤(1 个验证 SQL 窗口)
|
||||
- 无记录 → 允许入库
|
||||
- 空输入 → 不查 ClickHouse
|
||||
- build_insert_row 含 channel 字段
|
||||
|
||||
---
|
||||
|
||||
## Plan 02 — DATA-04: company_job channel 区分
|
||||
|
||||
**目的:** 区分`搜索职位(channel=mini)`和`公司关联职位(channel=company)`
|
||||
|
||||
| 文件 | 变更 |
|
||||
|------|------|
|
||||
| `schemas/ingest.py` | `ChannelType.COMPANY = "company"` |
|
||||
| `configs/boss.py` | 新注册 `channel=company, data_type=job` |
|
||||
| `configs/qcwy.py` | 同上 |
|
||||
| `configs/zhilian.py` | 同上 |
|
||||
| `company_jobs_sync.py` | `"mini"` → `"company"` |
|
||||
|
||||
---
|
||||
|
||||
## DATA-02 + DATA-03 确认(无需改动)
|
||||
|
||||
- **DATA-02:** `job.py` 已有 `/data/batch-async` 端点,spiderJobs 直接调用即可 ✅
|
||||
- **DATA-03:** `company_cleaner.py` 已完整实现 ClickHouse→爬详情→MySQL 全链路 ✅
|
||||
Loading…
x
Reference in New Issue
Block a user