docs(01-02): complete sign algorithms plan — SUMMARY, STATE, ROADMAP updates
- Create 01-02-SUMMARY.md with implementation details, test counts, deviation docs - STATE.md: mark plan 02 complete, add decisions from plan 02, update position - ROADMAP.md: mark phase 1 plans 2/2 complete, update progress table - REQUIREMENTS.md: mark QUAL-01 complete (41 sign algorithm unit tests)
This commit is contained in:
parent
333a6d155e
commit
76085ac403
@ -25,7 +25,7 @@
|
||||
|
||||
### 质量保证 (QUAL)
|
||||
|
||||
- [ ] **QUAL-01**: 核心签名算法单元测试覆盖
|
||||
- [x] **QUAL-01**: 核心签名算法单元测试覆盖
|
||||
- [ ] **QUAL-02**: 数据解析和去重逻辑单元测试覆盖
|
||||
- [ ] **QUAL-03**: HTTP 请求层使用 mock/respx 测试
|
||||
- [x] **QUAL-04**: 完善结构化日志(loguru 统一格式)
|
||||
@ -67,7 +67,7 @@
|
||||
| DATA-02 | TBD | Pending |
|
||||
| DATA-03 | TBD | Pending |
|
||||
| DATA-04 | TBD | Pending |
|
||||
| QUAL-01 | TBD | Pending |
|
||||
| QUAL-01 | Phase 1 | Complete |
|
||||
| QUAL-02 | TBD | Pending |
|
||||
| QUAL-03 | TBD | Pending |
|
||||
| QUAL-04 | TBD | Complete |
|
||||
|
||||
@ -30,10 +30,10 @@
|
||||
3. BaseFetcher/BaseSearcher 基类定义完整,提供模板方法供子类实现
|
||||
4. loguru 日志格式统一,tenacity 重试装饰器可开箱即用
|
||||
5. 旧爬虫(jobs_spider/)仍正常运行,未被改动(feature flag 隔离)
|
||||
**Plans:** 1/2 plans executed
|
||||
**Plans:** 2/2 plans executed
|
||||
Plans:
|
||||
- [x] 01-01-PLAN.md — Package scaffold: pyproject.toml + HTTPClient (tenacity + logging) + BaseFetcher/BaseSearcher + Pipfile deps
|
||||
- [ ] 01-02-PLAN.md — Sign algorithms: port BossSign/Job51Sign/ZhilianSign to crawler_core/ + unit tests (pytest)
|
||||
- [x] 01-02-PLAN.md — Sign algorithms: port BossSign/Job51Sign/ZhilianSign to crawler_core/ + unit tests (pytest)
|
||||
|
||||
### Phase 2: Boss 直聘重写
|
||||
**Goal:** Boss 直聘爬虫完全基于 crawler_core 运行,旧实现可安全停用
|
||||
@ -96,7 +96,7 @@ Plans:
|
||||
|
||||
| Phase | Plans Complete | Status | Completed |
|
||||
|-------|----------------|--------|-----------|
|
||||
| 1. 共享核心包 | 1/2 | In Progress| |
|
||||
| 1. 共享核心包 | 2/2 | Complete | 2026-03-21 |
|
||||
| 2. Boss 直聘重写 | 0/? | Not started | - |
|
||||
| 3. 前程无忧 & 智联重写 | 0/? | Not started | - |
|
||||
| 4. 后端 & 外部脚本接入 | 0/? | Not started | - |
|
||||
@ -134,4 +134,4 @@ Plans:
|
||||
---
|
||||
|
||||
*Roadmap created: 2026-03-21*
|
||||
*Last updated: 2026-03-21 — Phase 1 plans created (01-01, 01-02)*
|
||||
*Last updated: 2026-03-21 — Phase 1 complete (01-01, 01-02 done); 41 sign algorithm unit tests passing*
|
||||
|
||||
@ -3,19 +3,19 @@ gsd_state_version: 1.0
|
||||
milestone: v1.0
|
||||
milestone_name: milestone
|
||||
status: unknown
|
||||
last_updated: "2026-03-21T10:12:08.079Z"
|
||||
last_updated: "2026-03-21T10:23:00.000Z"
|
||||
progress:
|
||||
total_phases: 6
|
||||
completed_phases: 0
|
||||
total_plans: 2
|
||||
completed_plans: 1
|
||||
completed_plans: 2
|
||||
---
|
||||
|
||||
# STATE: JobData 爬虫交互重构
|
||||
|
||||
**Last updated:** 2026-03-21T10:12Z
|
||||
**Last updated:** 2026-03-21T10:23Z
|
||||
**Milestone:** 爬虫交互重构 v1
|
||||
**Stopped At:** Completed 01-shared-core Plan 01 (crawler_core package)
|
||||
**Stopped At:** Completed 01-shared-core Plan 02 (sign algorithms + unit tests)
|
||||
|
||||
---
|
||||
|
||||
@ -23,26 +23,27 @@ progress:
|
||||
|
||||
**Core value:** 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse,定时完成公司信息采集同步
|
||||
|
||||
**Current focus:** Phase 01 — shared-core
|
||||
**Current focus:** Phase 01 — shared-core (COMPLETE)
|
||||
|
||||
---
|
||||
|
||||
## Current Position
|
||||
|
||||
Phase: 01 (shared-core) — EXECUTING
|
||||
Plan: 2 of 2
|
||||
Phase: 01 (shared-core) — COMPLETE
|
||||
Plan: 2 of 2 (all plans done)
|
||||
|
||||
## Performance Metrics
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Plans completed | 0 |
|
||||
| Plans completed | 2 |
|
||||
| Plans total (estimated) | TBD |
|
||||
| Phases completed | 0/6 |
|
||||
| Requirements mapped | 19/19 |
|
||||
|
||||
---
|
||||
| Phase 01-shared-core P01 | 15min | 3 tasks | 8 files |
|
||||
| Phase 01-shared-core P02 | 15min | 2 tasks | 7 files |
|
||||
|
||||
## Accumulated Context
|
||||
|
||||
@ -59,6 +60,9 @@ Plan: 2 of 2
|
||||
| tenacity min=10s wait | STACK.md mandatory 10s anti-detection delay preserved | Decided (01-01) |
|
||||
| stdlib logging in crawler_core | No loguru dependency; loguru bridge deferred to Phase 5 | Decided (01-01) |
|
||||
| Result[T] replaces ApiResult | Typed generic eliminates untyped Any in return values | Decided (01-01) |
|
||||
| Sign algorithms ported verbatim | Behavioral parity with originals; no regression risk | Decided (01-02) |
|
||||
| No tests/crawler_core/__init__.py | Would shadow crawler_core package causing import failures | Decided (01-02) |
|
||||
| conftest.py at project root | Reliable sys.path setup for pytest crawler_core imports | Decided (01-02) |
|
||||
|
||||
### Architecture Notes
|
||||
|
||||
@ -66,6 +70,7 @@ Plan: 2 of 2
|
||||
- 核心保持同步(requests_go),后端通过 asyncio.to_thread() 调用
|
||||
- SmartIPManager 在 FastAPI 中需做成单例,避免会话状态丢失
|
||||
- ClickHouse 表结构不变,MySQL 模型变更走 Aerich
|
||||
- 签名算法测试:41 个纯函数单元测试,无网络依赖,`pytest tests/crawler_core/` 即可运行
|
||||
|
||||
### Known Risks
|
||||
|
||||
@ -102,7 +107,8 @@ None currently.
|
||||
- `.planning/ROADMAP.md` — Phase structure and success criteria
|
||||
- `.planning/REQUIREMENTS.md` — All v1 requirements with traceability
|
||||
- `.planning/PROJECT.md` — Project context and constraints
|
||||
- `crawler_core/` — CREATED: shared package with HTTPClient, BaseFetcher, BaseSearcher, Result[T]
|
||||
- `crawler_core/` — CREATED: shared package with HTTPClient, BaseFetcher, BaseSearcher, Result[T], plus sign algorithm subpackages
|
||||
- `tests/crawler_core/` — CREATED: 41 pure-function sign algorithm unit tests
|
||||
- `spiderJobs/` — external scripts using crawler_core
|
||||
- `app/services/crawler/` — backend facade layer
|
||||
|
||||
|
||||
154
.planning/phases/01-shared-core/01-02-SUMMARY.md
Normal file
154
.planning/phases/01-shared-core/01-02-SUMMARY.md
Normal file
@ -0,0 +1,154 @@
|
||||
---
|
||||
phase: 01-shared-core
|
||||
plan: "02"
|
||||
subsystem: crawler_core
|
||||
tags: [python, sign-algorithms, unit-tests, pure-functions, boss, qcwy, zhilian]
|
||||
dependency_graph:
|
||||
requires: [01-01]
|
||||
provides: [crawler_core/boss/sign.py, crawler_core/qcwy/sign.py, crawler_core/zhilian/sign.py]
|
||||
affects: [spiderJobs, app/services/crawler, tests/crawler_core]
|
||||
tech_stack:
|
||||
added: [pytest-for-sign-tests]
|
||||
patterns: [pure-function-testing, sys-path-conftest, try-except-import-guard]
|
||||
key_files:
|
||||
created:
|
||||
- crawler_core/boss/sign.py
|
||||
- crawler_core/qcwy/sign.py
|
||||
- crawler_core/zhilian/sign.py
|
||||
- tests/crawler_core/test_boss_sign.py
|
||||
- tests/crawler_core/test_qcwy_sign.py
|
||||
- tests/crawler_core/test_zhilian_sign.py
|
||||
- conftest.py
|
||||
modified:
|
||||
- pyproject.toml
|
||||
- crawler_core/__init__.py
|
||||
decisions:
|
||||
- "Sign algorithm files ported verbatim from spiderJobs/platforms/ — only module docstring updated to reference crawler_core"
|
||||
- "tests/crawler_core/__init__.py NOT created (would shadow crawler_core package as namespace package, breaking imports)"
|
||||
- "conftest.py at project root adds project root to sys.path for reliable crawler_core subpackage imports"
|
||||
- "pyproject.toml [tool.pytest.ini_options] pythonpath=['.'] added for pytest runner path setup"
|
||||
- "crawler_core/__init__.py wrapped in try/except ImportError to allow sign subpackages to import in envs without requests_go"
|
||||
metrics:
|
||||
duration: "~15 minutes"
|
||||
completed: "2026-03-21T10:22:46Z"
|
||||
tasks_completed: 2
|
||||
files_created: 7
|
||||
files_modified: 2
|
||||
---
|
||||
|
||||
# Phase 01 Plan 02: Sign Algorithm Port + Unit Tests — Summary
|
||||
|
||||
**One-liner:** Ported Boss/51Job/Zhilian sign algorithms verbatim from `spiderJobs/platforms/` to `crawler_core/{boss,qcwy,zhilian}/sign.py` and wrote 41 pure-function unit tests (no HTTP, no mocks) covering all exported classes and helpers.
|
||||
|
||||
## What Was Created
|
||||
|
||||
| File | Lines | Purpose |
|
||||
|------|-------|---------|
|
||||
| `crawler_core/boss/sign.py` | 108 | BossSign traceid generator — pure stdlib (random, time), ported from spiderJobs |
|
||||
| `crawler_core/qcwy/sign.py` | 89 | Job51Sign HMAC-SHA256 signing — pure stdlib (hmac, hashlib), ported from spiderJobs |
|
||||
| `crawler_core/zhilian/sign.py` | 87 | ZhilianSign header/param signing — pure stdlib (math, random), ported from spiderJobs |
|
||||
| `tests/crawler_core/test_boss_sign.py` | 85 | 13 tests: BossSign, _compute_checksum, _generate_uuid, _CHARS |
|
||||
| `tests/crawler_core/test_qcwy_sign.py` | 73 | 12 tests: Job51Sign init, build_sign_path, generate_uuid |
|
||||
| `tests/crawler_core/test_zhilian_sign.py` | 91 | 16 tests: ZhilianSign init, sign_headers, sign_params, generate_uuid |
|
||||
| `conftest.py` | 4 | Project root sys.path setup for pytest import resolution |
|
||||
|
||||
## Exported Interfaces
|
||||
|
||||
**crawler_core/boss/sign.py:**
|
||||
```python
|
||||
from crawler_core.boss.sign import BossSign, _compute_checksum, _generate_uuid, _CHARS
|
||||
|
||||
BossSign.generate_traceid(prefix="M-W") -> str # "M-W" + 19-char uuid + 3-char checksum = 25 chars
|
||||
BossSign(mpt="", wt2="") # token holders, no HTTP
|
||||
_compute_checksum(uuid_str: str) -> str # 3-char base62 deterministic checksum
|
||||
_generate_uuid() -> str # 13-char hex timestamp + 6-char base62 random = 19 chars
|
||||
```
|
||||
|
||||
**crawler_core/qcwy/sign.py:**
|
||||
```python
|
||||
from crawler_core.qcwy.sign import Job51Sign, SIGN_KEY
|
||||
|
||||
SIGN_KEY = "abfc8f9d..." # 64-char HMAC key
|
||||
Job51Sign(sign_key=SIGN_KEY)
|
||||
Job51Sign.generate_uuid() -> str # 13-char ms timestamp + 10-char random int = 23 digits
|
||||
Job51Sign.build_sign_path(endpoint, method, params, body) -> tuple[str, str] # (url_path, sign_hex)
|
||||
```
|
||||
|
||||
**crawler_core/zhilian/sign.py:**
|
||||
```python
|
||||
from crawler_core.zhilian.sign import ZhilianSign
|
||||
|
||||
ZhilianSign(at="", rt="", device_id=None, version="4.1.259", channel="wxxiaochengxu", platform="12")
|
||||
ZhilianSign.generate_uuid() -> str # UUID4-format uppercase hex, 36 chars
|
||||
ZhilianSign.sign_headers(page_code="0") -> dict # 9 headers including x-zp-device-id, x-zp-business-system="73"
|
||||
ZhilianSign.sign_params() -> dict # 6 params: at, rt, channel, platform, version, d
|
||||
```
|
||||
|
||||
## Test Counts
|
||||
|
||||
| File | Tests | Classes |
|
||||
|------|-------|---------|
|
||||
| `test_boss_sign.py` | 13 | TestBossSignGenerateTraceid (6), TestComputeChecksum (4), TestGenerateUuid (3) |
|
||||
| `test_qcwy_sign.py` | 12 | TestJob51SignInit (2), TestJob51SignBuildSignPath (7), TestJob51SignGenerateUuid (3) |
|
||||
| `test_zhilian_sign.py` | 16 | TestZhilianSignInit (4), TestZhilianSignHeaders (6), TestZhilianSignParams (3), TestZhilianGenerateUuid (3) |
|
||||
| **Total** | **41** | **10 test classes** |
|
||||
|
||||
## Decisions Made
|
||||
|
||||
| Decision | Rationale |
|
||||
|----------|-----------|
|
||||
| Port verbatim, change only docstring | Ensures behavioral parity with originals; no regression risk |
|
||||
| No `tests/crawler_core/__init__.py` | Creating it would make `tests/crawler_core/` a Python package named `crawler_core`, shadowing the real package and causing `ModuleNotFoundError: No module named 'crawler_core.boss'` |
|
||||
| `conftest.py` at project root | Pytest loads conftest.py before test collection; sys.path.insert here ensures crawler_core is findable |
|
||||
| `try/except ImportError` in crawler_core/__init__.py | `requests_go` is not installed in the lightweight pyenv Python used for tests; wrapping allows sign subpackages to be imported without the heavy HTTP client |
|
||||
| `pyproject.toml pythonpath=["."]` | Belt-and-suspenders path setup for pytest in addition to conftest.py |
|
||||
| Old spiderJobs files preserved | Plan explicitly states not to delete/modify until Phase 4 cleanup |
|
||||
|
||||
## Commits
|
||||
|
||||
| Hash | Task | Description |
|
||||
|------|------|-------------|
|
||||
| `bd1e50e` | Task 1 | Port sign algorithms to crawler_core/ platform directories |
|
||||
| `333a6d1` | Task 2 | Write sign algorithm unit tests for crawler_core |
|
||||
|
||||
## Deviations from Plan
|
||||
|
||||
### Auto-fixed Issues
|
||||
|
||||
**1. [Rule 2 - Missing Critical Functionality] Removed tests/crawler_core/__init__.py**
|
||||
- **Found during:** Task 2 verification
|
||||
- **Issue:** Creating `tests/crawler_core/__init__.py` (as the plan specified) caused pytest to treat `tests/crawler_core/` as the `crawler_core` package, shadowing the actual `crawler_core/` package at project root. Result: `ModuleNotFoundError: No module named 'crawler_core.boss'`
|
||||
- **Fix:** Removed `tests/crawler_core/__init__.py`; added `conftest.py` at project root; added `[tool.pytest.ini_options] pythonpath=["."]` to `pyproject.toml`
|
||||
- **Files modified:** `conftest.py` (new), `pyproject.toml` (modified), `crawler_core/__init__.py` (try/except guard — actually already done by 01-01 executor)
|
||||
- **Commits:** `333a6d1`
|
||||
|
||||
## Known Stubs
|
||||
|
||||
None — all three sign algorithm files are fully implemented pure functions. The test files are complete with all specified test cases.
|
||||
|
||||
## Self-Check: PASSED
|
||||
|
||||
All created files exist:
|
||||
- [x] `crawler_core/boss/sign.py` — FOUND
|
||||
- [x] `crawler_core/qcwy/sign.py` — FOUND
|
||||
- [x] `crawler_core/zhilian/sign.py` — FOUND
|
||||
- [x] `tests/crawler_core/test_boss_sign.py` — FOUND
|
||||
- [x] `tests/crawler_core/test_qcwy_sign.py` — FOUND
|
||||
- [x] `tests/crawler_core/test_zhilian_sign.py` — FOUND
|
||||
- [x] `conftest.py` — FOUND
|
||||
|
||||
All commits exist:
|
||||
- [x] `bd1e50e` — FOUND (feat(01-02): port sign algorithms)
|
||||
- [x] `333a6d1` — FOUND (test(01-02): write sign algorithm unit tests)
|
||||
|
||||
Cross-contamination checks:
|
||||
- `grep -r "from spiderJobs" crawler_core/` → PASSED: no spiderJobs imports
|
||||
- `grep -r "import requests" crawler_core/boss/sign.py crawler_core/qcwy/sign.py crawler_core/zhilian/sign.py` → PASSED: no HTTP imports
|
||||
- `git diff --name-only spiderJobs/` → PASSED: spiderJobs files untouched
|
||||
|
||||
Test count: 41 tests across 3 files (requirement: 24+ tests)
|
||||
- test_boss_sign.py: 13 tests (requirement: 8+)
|
||||
- test_qcwy_sign.py: 12 tests (requirement: 7+) — Note: plan specifies 5+ but acceptance criteria says 7+
|
||||
- test_zhilian_sign.py: 16 tests (requirement: 9+)
|
||||
|
||||
Import verification: All three sign modules import successfully from crawler_core with project root on sys.path.
|
||||
Loading…
x
Reference in New Issue
Block a user