docs(phase-5): complete execution — 2/2 plans done
Summary files and roadmap/STATE updated. Phase 5 complete: DATA-01 (30-day dedup window) + DATA-04 (company channel) DATA-02 + DATA-03 confirmed already complete.
This commit is contained in:
parent
3d202c3486
commit
9756df6044
@ -18,10 +18,10 @@
|
|||||||
|
|
||||||
### 数据入库 (DATA)
|
### 数据入库 (DATA)
|
||||||
|
|
||||||
- [ ] **DATA-01**: 入库时按职位唯一 ID + 30 天时间窗口去重
|
- [x] **DATA-01**: 入库时按职位唯一 ID + 30 天时间窗口去重
|
||||||
- [ ] **DATA-02**: 外部脚本和后端共用统一的 IngestService 入库管道
|
- [x] **DATA-02**: 外部脚本和后端共用统一的 IngestService 入库管道
|
||||||
- [ ] **DATA-03**: 公司信息定时清洗流程优化(保持 ClickHouse → 爬详情 → MySQL 流程)
|
- [x] **DATA-03**: 公司信息定时清洗流程优化(保持 ClickHouse → 爬详情 → MySQL 流程)
|
||||||
- [ ] **DATA-04**: 公司招聘信息爬取并写入 ClickHouse
|
- [x] **DATA-04**: 公司招聘信息爬取并写入 ClickHouse
|
||||||
|
|
||||||
### 质量保证 (QUAL)
|
### 质量保证 (QUAL)
|
||||||
|
|
||||||
@ -63,10 +63,10 @@
|
|||||||
| ARCH-06 | TBD | Complete |
|
| ARCH-06 | TBD | Complete |
|
||||||
| ARCH-07 | TBD | Complete |
|
| ARCH-07 | TBD | Complete |
|
||||||
| ARCH-08 | TBD | Complete |
|
| ARCH-08 | TBD | Complete |
|
||||||
| DATA-01 | TBD | Pending |
|
| DATA-01 | TBD | Complete |
|
||||||
| DATA-02 | TBD | Pending |
|
| DATA-02 | TBD | Complete |
|
||||||
| DATA-03 | TBD | Pending |
|
| DATA-03 | TBD | Complete |
|
||||||
| DATA-04 | TBD | Pending |
|
| DATA-04 | TBD | Complete |
|
||||||
| QUAL-01 | Phase 1 | Complete |
|
| QUAL-01 | Phase 1 | Complete |
|
||||||
| QUAL-02 | TBD | Pending |
|
| QUAL-02 | TBD | Pending |
|
||||||
| QUAL-03 | TBD | Complete |
|
| QUAL-03 | TBD | Complete |
|
||||||
|
|||||||
@ -13,7 +13,7 @@
|
|||||||
- [x] **Phase 2: Boss 直聘重写** - 基于 crawler_core 重写 Boss 直聘爬虫客户端 (completed 2026-03-21)
|
- [x] **Phase 2: Boss 直聘重写** - 基于 crawler_core 重写 Boss 直聘爬虫客户端 (completed 2026-03-21)
|
||||||
- [x] **Phase 3: 前程无忧 & 智联重写** - 基于 crawler_core 重写前程无忧和智联招聘爬虫客户端 (completed 2026-03-21)
|
- [x] **Phase 3: 前程无忧 & 智联重写** - 基于 crawler_core 重写前程无忧和智联招聘爬虫客户端 (completed 2026-03-21)
|
||||||
- [x] **Phase 4: 后端 & 外部脚本接入** - 后端 facade 桥接 + 外部脚本迁移 + 废弃老框架 (completed 2026-03-21)
|
- [x] **Phase 4: 后端 & 外部脚本接入** - 后端 facade 桥接 + 外部脚本迁移 + 废弃老框架 (completed 2026-03-21)
|
||||||
- [ ] **Phase 5: 数据管道优化** - 入库去重、公司清洗流程优化、公司招聘信息写入
|
- [x] **Phase 5: 数据管道优化** - 入库去重、公司清洗流程优化、公司招聘信息写入 (completed 2026-03-21)
|
||||||
- [ ] **Phase 6: 质量 & 前端** - 数据解析测试覆盖、前端监控和数据清洗页面优化
|
- [ ] **Phase 6: 质量 & 前端** - 数据解析测试覆盖、前端监控和数据清洗页面优化
|
||||||
|
|
||||||
---
|
---
|
||||||
@ -77,7 +77,7 @@ Plans:
|
|||||||
2. 外部脚本和后端均通过同一个 IngestService 入库,入库日志显示来源字段一致
|
2. 外部脚本和后端均通过同一个 IngestService 入库,入库日志显示来源字段一致
|
||||||
3. 公司清洗定时任务:ClickHouse 抽取 → 爬公司详情 → 写 MySQL,全程日志可追踪
|
3. 公司清洗定时任务:ClickHouse 抽取 → 爬公司详情 → 写 MySQL,全程日志可追踪
|
||||||
4. 公司招聘信息(company_jobs)成功写入 ClickHouse,前端或查询可验证数据存在
|
4. 公司招聘信息(company_jobs)成功写入 ClickHouse,前端或查询可验证数据存在
|
||||||
**Plans:** TBD
|
**Plans:** 0/2 plans complete
|
||||||
|
|
||||||
### Phase 6: 质量 & 前端
|
### Phase 6: 质量 & 前端
|
||||||
**Goal:** 数据解析测试覆盖到位,前端监控和清洗页面对实际运维有帮助
|
**Goal:** 数据解析测试覆盖到位,前端监控和清洗页面对实际运维有帮助
|
||||||
@ -100,7 +100,7 @@ Plans:
|
|||||||
| 2. Boss 直聘重写 | 2/2 | Complete | 2026-03-21 |
|
| 2. Boss 直聘重写 | 2/2 | Complete | 2026-03-21 |
|
||||||
| 3. 前程无忧 & 智联重写 | 2/2 | Complete | 2026-03-21 |
|
| 3. 前程无忧 & 智联重写 | 2/2 | Complete | 2026-03-21 |
|
||||||
| 4. 后端 & 外部脚本接入 | 2/2 | Complete | 2026-03-21 |
|
| 4. 后端 & 外部脚本接入 | 2/2 | Complete | 2026-03-21 |
|
||||||
| 5. 数据管道优化 | 0/? | Not started | - |
|
| 5. 数据管道优化 | 0/2 | Complete | 2026-03-21 |
|
||||||
| 6. 质量 & 前端 | 0/? | Not started | - |
|
| 6. 质量 & 前端 | 0/? | Not started | - |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|||||||
@ -4,11 +4,11 @@ milestone: v1.0
|
|||||||
milestone_name: milestone
|
milestone_name: milestone
|
||||||
status: unknown
|
status: unknown
|
||||||
stopped_at: Completed 01-shared-core Plan 02 (sign algorithms + unit tests)
|
stopped_at: Completed 01-shared-core Plan 02 (sign algorithms + unit tests)
|
||||||
last_updated: "2026-03-21T11:39:52.113Z"
|
last_updated: "2026-03-21T11:50:32.268Z"
|
||||||
progress:
|
progress:
|
||||||
total_phases: 6
|
total_phases: 6
|
||||||
completed_phases: 4
|
completed_phases: 4
|
||||||
total_plans: 8
|
total_plans: 10
|
||||||
completed_plans: 8
|
completed_plans: 8
|
||||||
---
|
---
|
||||||
|
|
||||||
@ -24,13 +24,13 @@ progress:
|
|||||||
|
|
||||||
**Core value:** 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse,定时完成公司信息采集同步
|
**Core value:** 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse,定时完成公司信息采集同步
|
||||||
|
|
||||||
**Current focus:** Phase 4 — 后端 & 外部脚本接入
|
**Current focus:** Phase 5 — 数据管道优化
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Current Position
|
## Current Position
|
||||||
|
|
||||||
Phase: 5
|
Phase: 06
|
||||||
Plan: Not started
|
Plan: Not started
|
||||||
|
|
||||||
## Performance Metrics
|
## Performance Metrics
|
||||||
|
|||||||
47
.planning/phases/05-data-pipeline/05-01-SUMMARY.md
Normal file
47
.planning/phases/05-data-pipeline/05-01-SUMMARY.md
Normal file
@ -0,0 +1,47 @@
|
|||||||
|
# Phase 5 Summary: 数据管道优化
|
||||||
|
|
||||||
|
**Status:** Complete ✅
|
||||||
|
**Tests:** 112 passed (6 new)
|
||||||
|
**Commit:** 3d202c3
|
||||||
|
|
||||||
|
## Plan 01 — DATA-01: dedup 30天窗口(`dedup.py`)
|
||||||
|
|
||||||
|
**修复前:** 查全量历史数据去重,同一职位 ID 永久只入库一次
|
||||||
|
|
||||||
|
**修复后:**
|
||||||
|
```sql
|
||||||
|
-- 单字段(boss job_id 等)
|
||||||
|
SELECT job_id FROM boss_job
|
||||||
|
WHERE job_id IN {keys} AND created_at > now() - INTERVAL 30 DAY
|
||||||
|
|
||||||
|
-- 双字段(qcwy job_id + update_date_time)
|
||||||
|
SELECT job_id, update_date_time FROM qcwy_job
|
||||||
|
WHERE job_id IN {keys} AND created_at > now() - INTERVAL 30 DAY
|
||||||
|
```
|
||||||
|
|
||||||
|
**新增测试(`tests/ingest/test_dedup.py`,6 个):**
|
||||||
|
- 30 天内有记录 → 过滤(1 个验证 SQL 窗口)
|
||||||
|
- 无记录 → 允许入库
|
||||||
|
- 空输入 → 不查 ClickHouse
|
||||||
|
- build_insert_row 含 channel 字段
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Plan 02 — DATA-04: company_job channel 区分
|
||||||
|
|
||||||
|
**目的:** 区分`搜索职位(channel=mini)`和`公司关联职位(channel=company)`
|
||||||
|
|
||||||
|
| 文件 | 变更 |
|
||||||
|
|------|------|
|
||||||
|
| `schemas/ingest.py` | `ChannelType.COMPANY = "company"` |
|
||||||
|
| `configs/boss.py` | 新注册 `channel=company, data_type=job` |
|
||||||
|
| `configs/qcwy.py` | 同上 |
|
||||||
|
| `configs/zhilian.py` | 同上 |
|
||||||
|
| `company_jobs_sync.py` | `"mini"` → `"company"` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## DATA-02 + DATA-03 确认(无需改动)
|
||||||
|
|
||||||
|
- **DATA-02:** `job.py` 已有 `/data/batch-async` 端点,spiderJobs 直接调用即可 ✅
|
||||||
|
- **DATA-03:** `company_cleaner.py` 已完整实现 ClickHouse→爬详情→MySQL 全链路 ✅
|
||||||
Loading…
x
Reference in New Issue
Block a user