- ARCH-03: Boss crawler migrated to crawler_core (no inline signatures or HTTP boilerplate) - QUAL-03: 22 mock tests pass covering all Boss API classes - Anti-crawl mechanisms preserved (TLS fingerprint, proxy rotation, 10s delay) - Phase 1 regression: 41 tests still passing
86 lines
3.0 KiB
Markdown
86 lines
3.0 KiB
Markdown
# Requirements: JobData 爬虫交互重构
|
||
|
||
**Defined:** 2026-03-21
|
||
**Core Value:** 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse,定时完成公司信息采集同步
|
||
|
||
## v1 Requirements
|
||
|
||
### 核心架构 (ARCH)
|
||
|
||
- [x] **ARCH-01**: 提取 `crawler_core/` 为独立可安装包,包含签名、HTTP 客户端、响应解析核心逻辑
|
||
- [x] **ARCH-02**: 统一 BaseFetcher/BaseSearcher 基类,三平台实现模板方法模式
|
||
- [x] **ARCH-03**: Boss 直聘爬虫客户端基于 crawler_core 重写
|
||
- [ ] **ARCH-04**: 前程无忧爬虫客户端基于 crawler_core 重写
|
||
- [ ] **ARCH-05**: 智联招聘爬虫客户端基于 crawler_core 重写
|
||
- [ ] **ARCH-06**: 后端 app/services/crawler/ facade 层使用 asyncio.to_thread() 桥接同步核心
|
||
- [ ] **ARCH-07**: 外部脚本 spiderJobs/ 使用 crawler_core 替代内联代码
|
||
- [ ] **ARCH-08**: 废弃 jobs_spider/ 老框架,迁移到 spiderJobs/ + crawler_core
|
||
|
||
### 数据入库 (DATA)
|
||
|
||
- [ ] **DATA-01**: 入库时按职位唯一 ID + 30 天时间窗口去重
|
||
- [ ] **DATA-02**: 外部脚本和后端共用统一的 IngestService 入库管道
|
||
- [ ] **DATA-03**: 公司信息定时清洗流程优化(保持 ClickHouse → 爬详情 → MySQL 流程)
|
||
- [ ] **DATA-04**: 公司招聘信息爬取并写入 ClickHouse
|
||
|
||
### 质量保证 (QUAL)
|
||
|
||
- [x] **QUAL-01**: 核心签名算法单元测试覆盖
|
||
- [ ] **QUAL-02**: 数据解析和去重逻辑单元测试覆盖
|
||
- [x] **QUAL-03**: HTTP 请求层使用 mock/respx 测试
|
||
- [x] **QUAL-04**: 完善结构化日志(loguru 统一格式)
|
||
- [x] **QUAL-05**: 错误重试机制(tenacity 集成)
|
||
- [ ] **QUAL-06**: 前端爬虫监控页面优化
|
||
- [ ] **QUAL-07**: 前端数据清洗管理页面优化
|
||
|
||
## v2 Requirements
|
||
|
||
### 高级功能
|
||
|
||
- **ADV-01**: 数据量监控告警(关键词周期 0 记录自动告警)
|
||
- **ADV-02**: 爬虫健康检查 API
|
||
- **ADV-03**: 新平台扩展指南文档
|
||
|
||
## Out of Scope
|
||
|
||
| Feature | Reason |
|
||
|---------|--------|
|
||
| ECS 批量部署重构 | 用户明确排除,后续单独处理 |
|
||
| 新增爬虫平台 | 先把现有三个平台重构好 |
|
||
| 数据分析/统计模块 | 不在本次范围 |
|
||
| RBAC 权限系统变更 | 保持现有 |
|
||
| 实时推送通知 | 复杂度高,v1 不需要 |
|
||
|
||
## Traceability
|
||
|
||
| Requirement | Phase | Status |
|
||
|-------------|-------|--------|
|
||
| ARCH-01 | TBD | Complete |
|
||
| ARCH-02 | TBD | Complete |
|
||
| ARCH-03 | TBD | Complete |
|
||
| ARCH-04 | TBD | Pending |
|
||
| ARCH-05 | TBD | Pending |
|
||
| ARCH-06 | TBD | Pending |
|
||
| ARCH-07 | TBD | Pending |
|
||
| ARCH-08 | TBD | Pending |
|
||
| DATA-01 | TBD | Pending |
|
||
| DATA-02 | TBD | Pending |
|
||
| DATA-03 | TBD | Pending |
|
||
| DATA-04 | TBD | Pending |
|
||
| QUAL-01 | Phase 1 | Complete |
|
||
| QUAL-02 | TBD | Pending |
|
||
| QUAL-03 | TBD | Complete |
|
||
| QUAL-04 | TBD | Complete |
|
||
| QUAL-05 | TBD | Complete |
|
||
| QUAL-06 | TBD | Pending |
|
||
| QUAL-07 | TBD | Pending |
|
||
|
||
**Coverage:**
|
||
- v1 requirements: 19 total
|
||
- Mapped to phases: 0
|
||
- Unmapped: 19 ⚠️
|
||
|
||
---
|
||
*Requirements defined: 2026-03-21*
|
||
*Last updated: 2026-03-21 after initial definition*
|