108 lines
4.9 KiB
Markdown
108 lines
4.9 KiB
Markdown
# Phase 3: 前程无忧 & 智联重写 — 技术研究
|
||
|
||
**研究日期:** 2026-03-21
|
||
**阶段目标:** 前程无忧和智联招聘爬虫完全基于 crawler_core 运行,三平台统一使用新基类
|
||
|
||
---
|
||
|
||
## 1. 现状分析
|
||
|
||
### 1.1 crawler_core 现有基础(Phase 1 完成)
|
||
|
||
| 文件 | 内容 |
|
||
|------|------|
|
||
| `crawler_core/qcwy/sign.py` | `Job51Sign.build_sign_path()` — HMAC-SHA256 签名 |
|
||
| `crawler_core/zhilian/sign.py` | `ZhilianSign.sign_headers()/sign_params()` — 智联多类型签名 |
|
||
| `crawler_core/http_client.py` | `HTTPClient` — TLS 伪装 + 代理 + tenacity 重试 |
|
||
| `crawler_core/base.py` | `Result[T]`, `BaseFetcher`, `BaseSearcher` |
|
||
|
||
### 1.2 前程无忧(job51)待迁移层
|
||
|
||
`spiderJobs/platforms/job51/` 下已有全部文件:
|
||
|
||
| 文件 | 旧依赖 | 迁移目标 |
|
||
|------|--------|---------|
|
||
| `client.py` | `spiderJobs.core.http_client.HTTPClient` + `spiderJobs.platforms.job51.sign.Job51Sign` | → `crawler_core.http_client.HTTPClient` + `crawler_core.qcwy.sign.Job51Sign` |
|
||
| `api.py` | `spiderJobs.core.base.ApiResult/BaseFetcher/BaseSearcher` | → `crawler_core.base.Result/BaseFetcher/BaseSearcher` |
|
||
| `main.py` | `spiderJobs.core.base.BaseFetcher/BaseSearcher` | → `crawler_core.base.BaseFetcher/BaseSearcher` |
|
||
| `sign.py` | 独立实现(与 crawler_core/qcwy/sign.py 相同) | → 向后兼容桩,重新导出 `Job51Sign` |
|
||
|
||
**job51/api.py 具体变更:**
|
||
- 第 14 行 import 替换
|
||
- `ApiResult` 全量替换为 `Result`(共 11 处)
|
||
- 第 164 行:`self._http.get(endpoint)` → `self.http_client.get(endpoint)`
|
||
- 第 208 行:`self._http.get(self.ENDPOINT, ...)` → `self.http_client.get(self.ENDPOINT, ...)`
|
||
|
||
### 1.3 智联招聘(zhilian)待迁移层
|
||
|
||
`spiderJobs/platforms/zhilian/` 下已有全部文件:
|
||
|
||
| 文件 | 旧依赖 | 迁移目标 |
|
||
|------|--------|---------|
|
||
| `client.py` | `spiderJobs.core.http_client.HTTPClient` + `spiderJobs.platforms.zhilian.sign.ZhilianSign` | → `crawler_core.http_client.HTTPClient` + `crawler_core.zhilian.sign.ZhilianSign` |
|
||
| `api.py` | `spiderJobs.core.base.BaseFetcher/BaseSearcher` | → `crawler_core.base.BaseFetcher/BaseSearcher` |
|
||
| `main.py` | `spiderJobs.core.base.BaseFetcher/BaseSearcher` | → `crawler_core.base.BaseFetcher/BaseSearcher` |
|
||
| `sign.py` | 独立实现(与 crawler_core/zhilian/sign.py 相同) | → 向后兼容桩,重新导出 `ZhilianSign` |
|
||
|
||
**zhilian/api.py 具体变更:**
|
||
- 第 10 行 import 替换(无 ApiResult,zhilian 使用 crawler_core 的默认解析器,无需自定义 _parse_response)
|
||
- 第 200 行:`return self._http.get(` → `return self.http_client.get(`
|
||
|
||
**重要差异:** 智联 api.py 中 `SearchCompanyPositions._build_params()` 第 184 行使用了 `self._client.signer.sign_params()`,这是通过 `self._client`(设为传入的 ZhilianClient)间接访问 signer 的,迁移后不受影响(属性名不变)。
|
||
|
||
---
|
||
|
||
## 2. 对比 Phase 2(Boss)的工作量
|
||
|
||
| 维度 | Phase 2 Boss | Phase 3 job51 | Phase 3 zhilian |
|
||
|------|-------------|---------------|-----------------|
|
||
| `ApiResult` 替换 | 11 处 | 11 处 | 0 处(无自定义解析器) |
|
||
| `self._http` 替换 | 3 处 | 2 处 | 1 处 |
|
||
| sign.py → 桩 | ✓ | ✓ | ✓ |
|
||
| client.py import | ✓ | ✓ | ✓ |
|
||
| api.py import | ✓ | ✓ | ✓ |
|
||
| main.py import | ✓ | ✓ | ✓ |
|
||
|
||
---
|
||
|
||
## 3. mock 测试策略
|
||
|
||
### 3.1 job51 测试(tests/job51/test_job51_client.py)
|
||
|
||
job51 的 mock 策略与 Boss 完全一致:用 `MagicMock()` mock `http_client`,测试:
|
||
- `_parse_job51_response`(纯函数,覆盖 status/1 成功、非 1 失败、HTTP 错误)
|
||
- `SearchRecommendJobs.search()`(正常、HTTP 错误)
|
||
- `GetJobDetail.fetch()`(成功,异常捕获)
|
||
- `GetCompanyDetail.fetch()`(成功)
|
||
- `Job51Client._job51_headers()`(sign 注入)
|
||
|
||
### 3.2 zhilian 测试(tests/zhilian/test_zhilian_client.py)
|
||
|
||
智联的请求走 POST(cgate)或 GET(capi),mock 方式相同:
|
||
- `SearchPositions.search()`(正常、错误)
|
||
- `GetPositionDetail.fetch()`(成功)
|
||
- `SearchCompanyPositions.search()`(成功,特别验证 sign_params 被调用)
|
||
- `ZhilianClient.post/get`(验证签名头注入)
|
||
|
||
---
|
||
|
||
## 4. 成功标准验证
|
||
|
||
| 标准 | 验证方式 |
|
||
|------|---------|
|
||
| job51 继承 BaseFetcher/BaseSearcher | `issubclass()` 断言 |
|
||
| zhilian 继承 BaseFetcher/BaseSearcher | `issubclass()` 断言 |
|
||
| 两平台无内联签名或 HTTP 样板 | `grep` 无 requests import,无 hmac 在 client/api 中 |
|
||
| mock 测试通过 | `pytest tests/job51/ tests/zhilian/` |
|
||
| 三平台代码结构一致 | 代码审查:client/api/sign/main 四文件结构 |
|
||
|
||
---
|
||
|
||
## RESEARCH COMPLETE
|
||
|
||
**Phase 3 可以规划,分 2 个 PLAN:**
|
||
- **Plan 01**:迁移前程无忧(job51)层 + mock 测试
|
||
- **Plan 02**:迁移智联招聘(zhilian)层 + mock 测试
|
||
|
||
两个 Plan 无依赖关系,理论上可并行,但顺序执行更稳妥。
|