108 lines
4.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 3: 前程无忧 & 智联重写 — 技术研究
**研究日期:** 2026-03-21
**阶段目标:** 前程无忧和智联招聘爬虫完全基于 crawler_core 运行,三平台统一使用新基类
---
## 1. 现状分析
### 1.1 crawler_core 现有基础Phase 1 完成)
| 文件 | 内容 |
|------|------|
| `crawler_core/qcwy/sign.py` | `Job51Sign.build_sign_path()` — HMAC-SHA256 签名 |
| `crawler_core/zhilian/sign.py` | `ZhilianSign.sign_headers()/sign_params()` — 智联多类型签名 |
| `crawler_core/http_client.py` | `HTTPClient` — TLS 伪装 + 代理 + tenacity 重试 |
| `crawler_core/base.py` | `Result[T]`, `BaseFetcher`, `BaseSearcher` |
### 1.2 前程无忧job51待迁移层
`spiderJobs/platforms/job51/` 下已有全部文件:
| 文件 | 旧依赖 | 迁移目标 |
|------|--------|---------|
| `client.py` | `spiderJobs.core.http_client.HTTPClient` + `spiderJobs.platforms.job51.sign.Job51Sign` | → `crawler_core.http_client.HTTPClient` + `crawler_core.qcwy.sign.Job51Sign` |
| `api.py` | `spiderJobs.core.base.ApiResult/BaseFetcher/BaseSearcher` | → `crawler_core.base.Result/BaseFetcher/BaseSearcher` |
| `main.py` | `spiderJobs.core.base.BaseFetcher/BaseSearcher` | → `crawler_core.base.BaseFetcher/BaseSearcher` |
| `sign.py` | 独立实现(与 crawler_core/qcwy/sign.py 相同) | → 向后兼容桩,重新导出 `Job51Sign` |
**job51/api.py 具体变更:**
- 第 14 行 import 替换
- `ApiResult` 全量替换为 `Result`(共 11 处)
- 第 164 行:`self._http.get(endpoint)``self.http_client.get(endpoint)`
- 第 208 行:`self._http.get(self.ENDPOINT, ...)``self.http_client.get(self.ENDPOINT, ...)`
### 1.3 智联招聘zhilian待迁移层
`spiderJobs/platforms/zhilian/` 下已有全部文件:
| 文件 | 旧依赖 | 迁移目标 |
|------|--------|---------|
| `client.py` | `spiderJobs.core.http_client.HTTPClient` + `spiderJobs.platforms.zhilian.sign.ZhilianSign` | → `crawler_core.http_client.HTTPClient` + `crawler_core.zhilian.sign.ZhilianSign` |
| `api.py` | `spiderJobs.core.base.BaseFetcher/BaseSearcher` | → `crawler_core.base.BaseFetcher/BaseSearcher` |
| `main.py` | `spiderJobs.core.base.BaseFetcher/BaseSearcher` | → `crawler_core.base.BaseFetcher/BaseSearcher` |
| `sign.py` | 独立实现(与 crawler_core/zhilian/sign.py 相同) | → 向后兼容桩,重新导出 `ZhilianSign` |
**zhilian/api.py 具体变更:**
- 第 10 行 import 替换(无 ApiResultzhilian 使用 crawler_core 的默认解析器,无需自定义 _parse_response
- 第 200 行:`return self._http.get(``return self.http_client.get(`
**重要差异:** 智联 api.py 中 `SearchCompanyPositions._build_params()` 第 184 行使用了 `self._client.signer.sign_params()`,这是通过 `self._client`(设为传入的 ZhilianClient间接访问 signer 的,迁移后不受影响(属性名不变)。
---
## 2. 对比 Phase 2Boss的工作量
| 维度 | Phase 2 Boss | Phase 3 job51 | Phase 3 zhilian |
|------|-------------|---------------|-----------------|
| `ApiResult` 替换 | 11 处 | 11 处 | 0 处(无自定义解析器) |
| `self._http` 替换 | 3 处 | 2 处 | 1 处 |
| sign.py → 桩 | ✓ | ✓ | ✓ |
| client.py import | ✓ | ✓ | ✓ |
| api.py import | ✓ | ✓ | ✓ |
| main.py import | ✓ | ✓ | ✓ |
---
## 3. mock 测试策略
### 3.1 job51 测试tests/job51/test_job51_client.py
job51 的 mock 策略与 Boss 完全一致:用 `MagicMock()` mock `http_client`,测试:
- `_parse_job51_response`(纯函数,覆盖 status/1 成功、非 1 失败、HTTP 错误)
- `SearchRecommendJobs.search()`正常、HTTP 错误)
- `GetJobDetail.fetch()`(成功,异常捕获)
- `GetCompanyDetail.fetch()`(成功)
- `Job51Client._job51_headers()`sign 注入)
### 3.2 zhilian 测试tests/zhilian/test_zhilian_client.py
智联的请求走 POSTcgate或 GETcapimock 方式相同:
- `SearchPositions.search()`(正常、错误)
- `GetPositionDetail.fetch()`(成功)
- `SearchCompanyPositions.search()`(成功,特别验证 sign_params 被调用)
- `ZhilianClient.post/get`(验证签名头注入)
---
## 4. 成功标准验证
| 标准 | 验证方式 |
|------|---------|
| job51 继承 BaseFetcher/BaseSearcher | `issubclass()` 断言 |
| zhilian 继承 BaseFetcher/BaseSearcher | `issubclass()` 断言 |
| 两平台无内联签名或 HTTP 样板 | `grep` 无 requests import无 hmac 在 client/api 中 |
| mock 测试通过 | `pytest tests/job51/ tests/zhilian/` |
| 三平台代码结构一致 | 代码审查client/api/sign/main 四文件结构 |
---
## RESEARCH COMPLETE
**Phase 3 可以规划,分 2 个 PLAN**
- **Plan 01**迁移前程无忧job51层 + mock 测试
- **Plan 02**迁移智联招聘zhilian层 + mock 测试
两个 Plan 无依赖关系,理论上可并行,但顺序执行更稳妥。