docs(01-02): complete sign algorithms plan — SUMMARY, STATE, ROADMAP updates

- Create 01-02-SUMMARY.md with implementation details, test counts, deviation docs
- STATE.md: mark plan 02 complete, add decisions from plan 02, update position
- ROADMAP.md: mark phase 1 plans 2/2 complete, update progress table
- REQUIREMENTS.md: mark QUAL-01 complete (41 sign algorithm unit tests)
This commit is contained in:
win 2026-03-21 18:27:35 +08:00
parent 333a6d155e
commit 76085ac403
4 changed files with 175 additions and 15 deletions

View File

@ -25,7 +25,7 @@
### 质量保证 (QUAL)
- [ ] **QUAL-01**: 核心签名算法单元测试覆盖
- [x] **QUAL-01**: 核心签名算法单元测试覆盖
- [ ] **QUAL-02**: 数据解析和去重逻辑单元测试覆盖
- [ ] **QUAL-03**: HTTP 请求层使用 mock/respx 测试
- [x] **QUAL-04**: 完善结构化日志loguru 统一格式)
@ -67,7 +67,7 @@
| DATA-02 | TBD | Pending |
| DATA-03 | TBD | Pending |
| DATA-04 | TBD | Pending |
| QUAL-01 | TBD | Pending |
| QUAL-01 | Phase 1 | Complete |
| QUAL-02 | TBD | Pending |
| QUAL-03 | TBD | Pending |
| QUAL-04 | TBD | Complete |

View File

@ -30,10 +30,10 @@
3. BaseFetcher/BaseSearcher 基类定义完整,提供模板方法供子类实现
4. loguru 日志格式统一tenacity 重试装饰器可开箱即用
5. 旧爬虫jobs_spider/仍正常运行未被改动feature flag 隔离)
**Plans:** 1/2 plans executed
**Plans:** 2/2 plans executed
Plans:
- [x] 01-01-PLAN.md — Package scaffold: pyproject.toml + HTTPClient (tenacity + logging) + BaseFetcher/BaseSearcher + Pipfile deps
- [ ] 01-02-PLAN.md — Sign algorithms: port BossSign/Job51Sign/ZhilianSign to crawler_core/ + unit tests (pytest)
- [x] 01-02-PLAN.md — Sign algorithms: port BossSign/Job51Sign/ZhilianSign to crawler_core/ + unit tests (pytest)
### Phase 2: Boss 直聘重写
**Goal:** Boss 直聘爬虫完全基于 crawler_core 运行,旧实现可安全停用
@ -96,7 +96,7 @@ Plans:
| Phase | Plans Complete | Status | Completed |
|-------|----------------|--------|-----------|
| 1. 共享核心包 | 1/2 | In Progress| |
| 1. 共享核心包 | 2/2 | Complete | 2026-03-21 |
| 2. Boss 直聘重写 | 0/? | Not started | - |
| 3. 前程无忧 & 智联重写 | 0/? | Not started | - |
| 4. 后端 & 外部脚本接入 | 0/? | Not started | - |
@ -134,4 +134,4 @@ Plans:
---
*Roadmap created: 2026-03-21*
*Last updated: 2026-03-21 — Phase 1 plans created (01-01, 01-02)*
*Last updated: 2026-03-21 — Phase 1 complete (01-01, 01-02 done); 41 sign algorithm unit tests passing*

View File

@ -3,19 +3,19 @@ gsd_state_version: 1.0
milestone: v1.0
milestone_name: milestone
status: unknown
last_updated: "2026-03-21T10:12:08.079Z"
last_updated: "2026-03-21T10:23:00.000Z"
progress:
total_phases: 6
completed_phases: 0
total_plans: 2
completed_plans: 1
completed_plans: 2
---
# STATE: JobData 爬虫交互重构
**Last updated:** 2026-03-21T10:12Z
**Last updated:** 2026-03-21T10:23Z
**Milestone:** 爬虫交互重构 v1
**Stopped At:** Completed 01-shared-core Plan 01 (crawler_core package)
**Stopped At:** Completed 01-shared-core Plan 02 (sign algorithms + unit tests)
---
@ -23,26 +23,27 @@ progress:
**Core value:** 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse定时完成公司信息采集同步
**Current focus:** Phase 01 — shared-core
**Current focus:** Phase 01 — shared-core (COMPLETE)
---
## Current Position
Phase: 01 (shared-core) — EXECUTING
Plan: 2 of 2
Phase: 01 (shared-core) — COMPLETE
Plan: 2 of 2 (all plans done)
## Performance Metrics
| Metric | Value |
|--------|-------|
| Plans completed | 0 |
| Plans completed | 2 |
| Plans total (estimated) | TBD |
| Phases completed | 0/6 |
| Requirements mapped | 19/19 |
---
| Phase 01-shared-core P01 | 15min | 3 tasks | 8 files |
| Phase 01-shared-core P02 | 15min | 2 tasks | 7 files |
## Accumulated Context
@ -59,6 +60,9 @@ Plan: 2 of 2
| tenacity min=10s wait | STACK.md mandatory 10s anti-detection delay preserved | Decided (01-01) |
| stdlib logging in crawler_core | No loguru dependency; loguru bridge deferred to Phase 5 | Decided (01-01) |
| Result[T] replaces ApiResult | Typed generic eliminates untyped Any in return values | Decided (01-01) |
| Sign algorithms ported verbatim | Behavioral parity with originals; no regression risk | Decided (01-02) |
| No tests/crawler_core/__init__.py | Would shadow crawler_core package causing import failures | Decided (01-02) |
| conftest.py at project root | Reliable sys.path setup for pytest crawler_core imports | Decided (01-02) |
### Architecture Notes
@ -66,6 +70,7 @@ Plan: 2 of 2
- 核心保持同步requests_go后端通过 asyncio.to_thread() 调用
- SmartIPManager 在 FastAPI 中需做成单例,避免会话状态丢失
- ClickHouse 表结构不变MySQL 模型变更走 Aerich
- 签名算法测试41 个纯函数单元测试,无网络依赖,`pytest tests/crawler_core/` 即可运行
### Known Risks
@ -102,7 +107,8 @@ None currently.
- `.planning/ROADMAP.md` — Phase structure and success criteria
- `.planning/REQUIREMENTS.md` — All v1 requirements with traceability
- `.planning/PROJECT.md` — Project context and constraints
- `crawler_core/` — CREATED: shared package with HTTPClient, BaseFetcher, BaseSearcher, Result[T]
- `crawler_core/` — CREATED: shared package with HTTPClient, BaseFetcher, BaseSearcher, Result[T], plus sign algorithm subpackages
- `tests/crawler_core/` — CREATED: 41 pure-function sign algorithm unit tests
- `spiderJobs/` — external scripts using crawler_core
- `app/services/crawler/` — backend facade layer

View File

@ -0,0 +1,154 @@
---
phase: 01-shared-core
plan: "02"
subsystem: crawler_core
tags: [python, sign-algorithms, unit-tests, pure-functions, boss, qcwy, zhilian]
dependency_graph:
requires: [01-01]
provides: [crawler_core/boss/sign.py, crawler_core/qcwy/sign.py, crawler_core/zhilian/sign.py]
affects: [spiderJobs, app/services/crawler, tests/crawler_core]
tech_stack:
added: [pytest-for-sign-tests]
patterns: [pure-function-testing, sys-path-conftest, try-except-import-guard]
key_files:
created:
- crawler_core/boss/sign.py
- crawler_core/qcwy/sign.py
- crawler_core/zhilian/sign.py
- tests/crawler_core/test_boss_sign.py
- tests/crawler_core/test_qcwy_sign.py
- tests/crawler_core/test_zhilian_sign.py
- conftest.py
modified:
- pyproject.toml
- crawler_core/__init__.py
decisions:
- "Sign algorithm files ported verbatim from spiderJobs/platforms/ — only module docstring updated to reference crawler_core"
- "tests/crawler_core/__init__.py NOT created (would shadow crawler_core package as namespace package, breaking imports)"
- "conftest.py at project root adds project root to sys.path for reliable crawler_core subpackage imports"
- "pyproject.toml [tool.pytest.ini_options] pythonpath=['.'] added for pytest runner path setup"
- "crawler_core/__init__.py wrapped in try/except ImportError to allow sign subpackages to import in envs without requests_go"
metrics:
duration: "~15 minutes"
completed: "2026-03-21T10:22:46Z"
tasks_completed: 2
files_created: 7
files_modified: 2
---
# Phase 01 Plan 02: Sign Algorithm Port + Unit Tests — Summary
**One-liner:** Ported Boss/51Job/Zhilian sign algorithms verbatim from `spiderJobs/platforms/` to `crawler_core/{boss,qcwy,zhilian}/sign.py` and wrote 41 pure-function unit tests (no HTTP, no mocks) covering all exported classes and helpers.
## What Was Created
| File | Lines | Purpose |
|------|-------|---------|
| `crawler_core/boss/sign.py` | 108 | BossSign traceid generator — pure stdlib (random, time), ported from spiderJobs |
| `crawler_core/qcwy/sign.py` | 89 | Job51Sign HMAC-SHA256 signing — pure stdlib (hmac, hashlib), ported from spiderJobs |
| `crawler_core/zhilian/sign.py` | 87 | ZhilianSign header/param signing — pure stdlib (math, random), ported from spiderJobs |
| `tests/crawler_core/test_boss_sign.py` | 85 | 13 tests: BossSign, _compute_checksum, _generate_uuid, _CHARS |
| `tests/crawler_core/test_qcwy_sign.py` | 73 | 12 tests: Job51Sign init, build_sign_path, generate_uuid |
| `tests/crawler_core/test_zhilian_sign.py` | 91 | 16 tests: ZhilianSign init, sign_headers, sign_params, generate_uuid |
| `conftest.py` | 4 | Project root sys.path setup for pytest import resolution |
## Exported Interfaces
**crawler_core/boss/sign.py:**
```python
from crawler_core.boss.sign import BossSign, _compute_checksum, _generate_uuid, _CHARS
BossSign.generate_traceid(prefix="M-W") -> str # "M-W" + 19-char uuid + 3-char checksum = 25 chars
BossSign(mpt="", wt2="") # token holders, no HTTP
_compute_checksum(uuid_str: str) -> str # 3-char base62 deterministic checksum
_generate_uuid() -> str # 13-char hex timestamp + 6-char base62 random = 19 chars
```
**crawler_core/qcwy/sign.py:**
```python
from crawler_core.qcwy.sign import Job51Sign, SIGN_KEY
SIGN_KEY = "abfc8f9d..." # 64-char HMAC key
Job51Sign(sign_key=SIGN_KEY)
Job51Sign.generate_uuid() -> str # 13-char ms timestamp + 10-char random int = 23 digits
Job51Sign.build_sign_path(endpoint, method, params, body) -> tuple[str, str] # (url_path, sign_hex)
```
**crawler_core/zhilian/sign.py:**
```python
from crawler_core.zhilian.sign import ZhilianSign
ZhilianSign(at="", rt="", device_id=None, version="4.1.259", channel="wxxiaochengxu", platform="12")
ZhilianSign.generate_uuid() -> str # UUID4-format uppercase hex, 36 chars
ZhilianSign.sign_headers(page_code="0") -> dict # 9 headers including x-zp-device-id, x-zp-business-system="73"
ZhilianSign.sign_params() -> dict # 6 params: at, rt, channel, platform, version, d
```
## Test Counts
| File | Tests | Classes |
|------|-------|---------|
| `test_boss_sign.py` | 13 | TestBossSignGenerateTraceid (6), TestComputeChecksum (4), TestGenerateUuid (3) |
| `test_qcwy_sign.py` | 12 | TestJob51SignInit (2), TestJob51SignBuildSignPath (7), TestJob51SignGenerateUuid (3) |
| `test_zhilian_sign.py` | 16 | TestZhilianSignInit (4), TestZhilianSignHeaders (6), TestZhilianSignParams (3), TestZhilianGenerateUuid (3) |
| **Total** | **41** | **10 test classes** |
## Decisions Made
| Decision | Rationale |
|----------|-----------|
| Port verbatim, change only docstring | Ensures behavioral parity with originals; no regression risk |
| No `tests/crawler_core/__init__.py` | Creating it would make `tests/crawler_core/` a Python package named `crawler_core`, shadowing the real package and causing `ModuleNotFoundError: No module named 'crawler_core.boss'` |
| `conftest.py` at project root | Pytest loads conftest.py before test collection; sys.path.insert here ensures crawler_core is findable |
| `try/except ImportError` in crawler_core/__init__.py | `requests_go` is not installed in the lightweight pyenv Python used for tests; wrapping allows sign subpackages to be imported without the heavy HTTP client |
| `pyproject.toml pythonpath=["."]` | Belt-and-suspenders path setup for pytest in addition to conftest.py |
| Old spiderJobs files preserved | Plan explicitly states not to delete/modify until Phase 4 cleanup |
## Commits
| Hash | Task | Description |
|------|------|-------------|
| `bd1e50e` | Task 1 | Port sign algorithms to crawler_core/ platform directories |
| `333a6d1` | Task 2 | Write sign algorithm unit tests for crawler_core |
## Deviations from Plan
### Auto-fixed Issues
**1. [Rule 2 - Missing Critical Functionality] Removed tests/crawler_core/__init__.py**
- **Found during:** Task 2 verification
- **Issue:** Creating `tests/crawler_core/__init__.py` (as the plan specified) caused pytest to treat `tests/crawler_core/` as the `crawler_core` package, shadowing the actual `crawler_core/` package at project root. Result: `ModuleNotFoundError: No module named 'crawler_core.boss'`
- **Fix:** Removed `tests/crawler_core/__init__.py`; added `conftest.py` at project root; added `[tool.pytest.ini_options] pythonpath=["."]` to `pyproject.toml`
- **Files modified:** `conftest.py` (new), `pyproject.toml` (modified), `crawler_core/__init__.py` (try/except guard — actually already done by 01-01 executor)
- **Commits:** `333a6d1`
## Known Stubs
None — all three sign algorithm files are fully implemented pure functions. The test files are complete with all specified test cases.
## Self-Check: PASSED
All created files exist:
- [x] `crawler_core/boss/sign.py` — FOUND
- [x] `crawler_core/qcwy/sign.py` — FOUND
- [x] `crawler_core/zhilian/sign.py` — FOUND
- [x] `tests/crawler_core/test_boss_sign.py` — FOUND
- [x] `tests/crawler_core/test_qcwy_sign.py` — FOUND
- [x] `tests/crawler_core/test_zhilian_sign.py` — FOUND
- [x] `conftest.py` — FOUND
All commits exist:
- [x] `bd1e50e` — FOUND (feat(01-02): port sign algorithms)
- [x] `333a6d1` — FOUND (test(01-02): write sign algorithm unit tests)
Cross-contamination checks:
- `grep -r "from spiderJobs" crawler_core/` → PASSED: no spiderJobs imports
- `grep -r "import requests" crawler_core/boss/sign.py crawler_core/qcwy/sign.py crawler_core/zhilian/sign.py` → PASSED: no HTTP imports
- `git diff --name-only spiderJobs/` → PASSED: spiderJobs files untouched
Test count: 41 tests across 3 files (requirement: 24+ tests)
- test_boss_sign.py: 13 tests (requirement: 8+)
- test_qcwy_sign.py: 12 tests (requirement: 7+) — Note: plan specifies 5+ but acceptance criteria says 7+
- test_zhilian_sign.py: 16 tests (requirement: 9+)
Import verification: All three sign modules import successfully from crawler_core with project root on sys.path.