docs: initialize project
This commit is contained in:
parent
3d7e96845d
commit
9166c4f7bc
100
.planning/PROJECT.md
Normal file
100
.planning/PROJECT.md
Normal file
@ -0,0 +1,100 @@
|
||||
# JobData 爬虫交互重构
|
||||
|
||||
## What This Is
|
||||
|
||||
招聘数据采集平台的爬虫交互层重构项目。重写三个平台(Boss直聘、前程无忧、智联招聘)的爬虫代码,统一基类和共享核心逻辑,优化数据入库去重、公司信息清洗流程和前端监控页面。
|
||||
|
||||
## Core Value
|
||||
|
||||
基于关键词驱动爬虫抓取职位数据,可靠地入库到 ClickHouse,并通过定时任务完成公司信息的采集和同步。
|
||||
|
||||
## Requirements
|
||||
|
||||
### Validated
|
||||
|
||||
<!-- 已有且正常运行的能力 -->
|
||||
|
||||
- ✓ FastAPI 后端 REST API 框架 — existing
|
||||
- ✓ 双数据库架构(MySQL + ClickHouse)— existing
|
||||
- ✓ RBAC 权限管理系统 — existing
|
||||
- ✓ 关键词管理(CRUD + 爬虫拉取 /keyword/available)— existing
|
||||
- ✓ 数据入库管道(IngestService + Registry 模式)— existing
|
||||
- ✓ 定时任务调度(APScheduler)— existing
|
||||
- ✓ 公司信息定时清洗流程 — existing
|
||||
- ✓ Vue3 前端管理界面 — existing
|
||||
- ✓ 数据分析统计 API — existing
|
||||
|
||||
### Active
|
||||
|
||||
<!-- 本次重构目标 -->
|
||||
|
||||
- [ ] 统一爬虫基类:三个平台共享签名、请求、解析核心逻辑
|
||||
- [ ] 重写 Boss 直聘爬虫客户端
|
||||
- [ ] 重写前程无忧爬虫客户端
|
||||
- [ ] 重写智联招聘爬虫客户端
|
||||
- [ ] 外部脚本(jobs_spider/)和后端(app/services/crawler/)共享核心代码
|
||||
- [ ] 入库时按职位唯一 ID 去重
|
||||
- [ ] 公司信息爬取流程优化(ClickHouse 提取 → 爬详情 → MySQL)
|
||||
- [ ] 公司招聘信息爬取并写入 ClickHouse
|
||||
- [ ] 前端爬虫监控/数据清洗页面优化
|
||||
- [ ] 单元测试覆盖核心逻辑
|
||||
- [ ] 完善日志和错误重试机制
|
||||
|
||||
### Out of Scope
|
||||
|
||||
- ECS 批量部署(ecs_full_pipeline.py)— 本次只重构爬虫逻辑本身,部署后续单独处理
|
||||
- 新增爬虫平台 — 先把现有三个平台重构好
|
||||
- 数据分析/统计模块重构 — 不在本次范围内
|
||||
- RBAC 权限系统变更 — 保持现有
|
||||
|
||||
## Context
|
||||
|
||||
**现有代码库状态:**
|
||||
- `jobs_spider/` 下三个平台的爬虫脚本有大量重复代码(签名算法、HTTP 请求、数据解析)
|
||||
- `app/services/crawler/` 已有统一基类 `BaseFetcher`/`BaseSearcher`(`_base.py`)但实现不彻底
|
||||
- `spiderJobs/` 是另一套爬虫框架(含 core/、platforms/、runner/),与 jobs_spider/ 功能重叠
|
||||
- 数据入库已有 Registry 模式(`app/services/ingest/registry.py`)可复用
|
||||
- 去重逻辑分散在 `app/services/ingest/dedup.py` 和各爬虫脚本中
|
||||
- 公司清洗由 `CompanyCleaner`(`app/services/company_cleaner.py`)和 `CompanyJobsSyncService` 协同完成
|
||||
|
||||
**技术栈:**
|
||||
- Python 3.13, FastAPI 0.111, Tortoise-ORM, clickhouse-connect
|
||||
- requests/httpx 做 HTTP 请求
|
||||
- Vue 3.3 + Naive UI 前端
|
||||
|
||||
## Constraints
|
||||
|
||||
- **兼容性**: 重构期间不能中断现有爬虫数据采集
|
||||
- **共享核心**: 外部脚本和后端必须用同一套签名/请求/解析代码,避免代码重复
|
||||
- **数据库**: ClickHouse 表结构保持不变,MySQL 模型变更走 Aerich 迁移
|
||||
- **反爬**: 必须保留现有反爬机制(IP 轮换、随机延迟、代理管理)
|
||||
|
||||
## Key Decisions
|
||||
|
||||
| Decision | Rationale | Outcome |
|
||||
|----------|-----------|---------|
|
||||
| 保留两套爬虫(外部 + 后端)但共享核心 | 外部脚本在 ECS 上独立运行,后端内嵌用于清洗任务 | — Pending |
|
||||
| 三个平台全部重写 | 现有代码重复太多,修修补补不如统一重写 | — Pending |
|
||||
| 入库时按唯一 ID 去重 | 比爬取前查询更简单高效 | — Pending |
|
||||
| 保持关键词拉取模式 | 爬虫主动请求更灵活,无需后端推送基础设施 | — Pending |
|
||||
| 前端一起优化 | 监控页面是重构后验证的重要入口 | — Pending |
|
||||
|
||||
## Evolution
|
||||
|
||||
This document evolves at phase transitions and milestone boundaries.
|
||||
|
||||
**After each phase transition** (via `/gsd:transition`):
|
||||
1. Requirements invalidated? → Move to Out of Scope with reason
|
||||
2. Requirements validated? → Move to Validated with phase reference
|
||||
3. New requirements emerged? → Add to Active
|
||||
4. Decisions to log? → Add to Key Decisions
|
||||
5. "What This Is" still accurate? → Update if drifted
|
||||
|
||||
**After each milestone** (via `/gsd:complete-milestone`):
|
||||
1. Full review of all sections
|
||||
2. Core Value check — still the right priority?
|
||||
3. Audit Out of Scope — reasons still valid?
|
||||
4. Update Context with current state
|
||||
|
||||
---
|
||||
*Last updated: 2026-03-21 after initialization*
|
||||
Loading…
x
Reference in New Issue
Block a user