diff --git a/.planning/REQUIREMENTS.md b/.planning/REQUIREMENTS.md index 0b78350..42badc4 100644 --- a/.planning/REQUIREMENTS.md +++ b/.planning/REQUIREMENTS.md @@ -25,7 +25,7 @@ ### 质量保证 (QUAL) -- [ ] **QUAL-01**: 核心签名算法单元测试覆盖 +- [x] **QUAL-01**: 核心签名算法单元测试覆盖 - [ ] **QUAL-02**: 数据解析和去重逻辑单元测试覆盖 - [ ] **QUAL-03**: HTTP 请求层使用 mock/respx 测试 - [x] **QUAL-04**: 完善结构化日志(loguru 统一格式) @@ -67,7 +67,7 @@ | DATA-02 | TBD | Pending | | DATA-03 | TBD | Pending | | DATA-04 | TBD | Pending | -| QUAL-01 | TBD | Pending | +| QUAL-01 | Phase 1 | Complete | | QUAL-02 | TBD | Pending | | QUAL-03 | TBD | Pending | | QUAL-04 | TBD | Complete | diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index e5d382c..56f58bd 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -30,10 +30,10 @@ 3. BaseFetcher/BaseSearcher 基类定义完整,提供模板方法供子类实现 4. loguru 日志格式统一,tenacity 重试装饰器可开箱即用 5. 旧爬虫(jobs_spider/)仍正常运行,未被改动(feature flag 隔离) -**Plans:** 1/2 plans executed +**Plans:** 2/2 plans executed Plans: - [x] 01-01-PLAN.md — Package scaffold: pyproject.toml + HTTPClient (tenacity + logging) + BaseFetcher/BaseSearcher + Pipfile deps -- [ ] 01-02-PLAN.md — Sign algorithms: port BossSign/Job51Sign/ZhilianSign to crawler_core/ + unit tests (pytest) +- [x] 01-02-PLAN.md — Sign algorithms: port BossSign/Job51Sign/ZhilianSign to crawler_core/ + unit tests (pytest) ### Phase 2: Boss 直聘重写 **Goal:** Boss 直聘爬虫完全基于 crawler_core 运行,旧实现可安全停用 @@ -96,7 +96,7 @@ Plans: | Phase | Plans Complete | Status | Completed | |-------|----------------|--------|-----------| -| 1. 共享核心包 | 1/2 | In Progress| | +| 1. 共享核心包 | 2/2 | Complete | 2026-03-21 | | 2. Boss 直聘重写 | 0/? | Not started | - | | 3. 前程无忧 & 智联重写 | 0/? | Not started | - | | 4. 后端 & 外部脚本接入 | 0/? | Not started | - | @@ -134,4 +134,4 @@ Plans: --- *Roadmap created: 2026-03-21* -*Last updated: 2026-03-21 — Phase 1 plans created (01-01, 01-02)* +*Last updated: 2026-03-21 — Phase 1 complete (01-01, 01-02 done); 41 sign algorithm unit tests passing* diff --git a/.planning/STATE.md b/.planning/STATE.md index 2dda38a..1878320 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -3,19 +3,19 @@ gsd_state_version: 1.0 milestone: v1.0 milestone_name: milestone status: unknown -last_updated: "2026-03-21T10:12:08.079Z" +last_updated: "2026-03-21T10:23:00.000Z" progress: total_phases: 6 completed_phases: 0 total_plans: 2 - completed_plans: 1 + completed_plans: 2 --- # STATE: JobData 爬虫交互重构 -**Last updated:** 2026-03-21T10:12Z +**Last updated:** 2026-03-21T10:23Z **Milestone:** 爬虫交互重构 v1 -**Stopped At:** Completed 01-shared-core Plan 01 (crawler_core package) +**Stopped At:** Completed 01-shared-core Plan 02 (sign algorithms + unit tests) --- @@ -23,26 +23,27 @@ progress: **Core value:** 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse,定时完成公司信息采集同步 -**Current focus:** Phase 01 — shared-core +**Current focus:** Phase 01 — shared-core (COMPLETE) --- ## Current Position -Phase: 01 (shared-core) — EXECUTING -Plan: 2 of 2 +Phase: 01 (shared-core) — COMPLETE +Plan: 2 of 2 (all plans done) ## Performance Metrics | Metric | Value | |--------|-------| -| Plans completed | 0 | +| Plans completed | 2 | | Plans total (estimated) | TBD | | Phases completed | 0/6 | | Requirements mapped | 19/19 | --- | Phase 01-shared-core P01 | 15min | 3 tasks | 8 files | +| Phase 01-shared-core P02 | 15min | 2 tasks | 7 files | ## Accumulated Context @@ -59,6 +60,9 @@ Plan: 2 of 2 | tenacity min=10s wait | STACK.md mandatory 10s anti-detection delay preserved | Decided (01-01) | | stdlib logging in crawler_core | No loguru dependency; loguru bridge deferred to Phase 5 | Decided (01-01) | | Result[T] replaces ApiResult | Typed generic eliminates untyped Any in return values | Decided (01-01) | +| Sign algorithms ported verbatim | Behavioral parity with originals; no regression risk | Decided (01-02) | +| No tests/crawler_core/__init__.py | Would shadow crawler_core package causing import failures | Decided (01-02) | +| conftest.py at project root | Reliable sys.path setup for pytest crawler_core imports | Decided (01-02) | ### Architecture Notes @@ -66,6 +70,7 @@ Plan: 2 of 2 - 核心保持同步(requests_go),后端通过 asyncio.to_thread() 调用 - SmartIPManager 在 FastAPI 中需做成单例,避免会话状态丢失 - ClickHouse 表结构不变,MySQL 模型变更走 Aerich +- 签名算法测试:41 个纯函数单元测试,无网络依赖,`pytest tests/crawler_core/` 即可运行 ### Known Risks @@ -102,7 +107,8 @@ None currently. - `.planning/ROADMAP.md` — Phase structure and success criteria - `.planning/REQUIREMENTS.md` — All v1 requirements with traceability - `.planning/PROJECT.md` — Project context and constraints -- `crawler_core/` — CREATED: shared package with HTTPClient, BaseFetcher, BaseSearcher, Result[T] +- `crawler_core/` — CREATED: shared package with HTTPClient, BaseFetcher, BaseSearcher, Result[T], plus sign algorithm subpackages +- `tests/crawler_core/` — CREATED: 41 pure-function sign algorithm unit tests - `spiderJobs/` — external scripts using crawler_core - `app/services/crawler/` — backend facade layer diff --git a/.planning/phases/01-shared-core/01-02-SUMMARY.md b/.planning/phases/01-shared-core/01-02-SUMMARY.md new file mode 100644 index 0000000..210fed9 --- /dev/null +++ b/.planning/phases/01-shared-core/01-02-SUMMARY.md @@ -0,0 +1,154 @@ +--- +phase: 01-shared-core +plan: "02" +subsystem: crawler_core +tags: [python, sign-algorithms, unit-tests, pure-functions, boss, qcwy, zhilian] +dependency_graph: + requires: [01-01] + provides: [crawler_core/boss/sign.py, crawler_core/qcwy/sign.py, crawler_core/zhilian/sign.py] + affects: [spiderJobs, app/services/crawler, tests/crawler_core] +tech_stack: + added: [pytest-for-sign-tests] + patterns: [pure-function-testing, sys-path-conftest, try-except-import-guard] +key_files: + created: + - crawler_core/boss/sign.py + - crawler_core/qcwy/sign.py + - crawler_core/zhilian/sign.py + - tests/crawler_core/test_boss_sign.py + - tests/crawler_core/test_qcwy_sign.py + - tests/crawler_core/test_zhilian_sign.py + - conftest.py + modified: + - pyproject.toml + - crawler_core/__init__.py +decisions: + - "Sign algorithm files ported verbatim from spiderJobs/platforms/ — only module docstring updated to reference crawler_core" + - "tests/crawler_core/__init__.py NOT created (would shadow crawler_core package as namespace package, breaking imports)" + - "conftest.py at project root adds project root to sys.path for reliable crawler_core subpackage imports" + - "pyproject.toml [tool.pytest.ini_options] pythonpath=['.'] added for pytest runner path setup" + - "crawler_core/__init__.py wrapped in try/except ImportError to allow sign subpackages to import in envs without requests_go" +metrics: + duration: "~15 minutes" + completed: "2026-03-21T10:22:46Z" + tasks_completed: 2 + files_created: 7 + files_modified: 2 +--- + +# Phase 01 Plan 02: Sign Algorithm Port + Unit Tests — Summary + +**One-liner:** Ported Boss/51Job/Zhilian sign algorithms verbatim from `spiderJobs/platforms/` to `crawler_core/{boss,qcwy,zhilian}/sign.py` and wrote 41 pure-function unit tests (no HTTP, no mocks) covering all exported classes and helpers. + +## What Was Created + +| File | Lines | Purpose | +|------|-------|---------| +| `crawler_core/boss/sign.py` | 108 | BossSign traceid generator — pure stdlib (random, time), ported from spiderJobs | +| `crawler_core/qcwy/sign.py` | 89 | Job51Sign HMAC-SHA256 signing — pure stdlib (hmac, hashlib), ported from spiderJobs | +| `crawler_core/zhilian/sign.py` | 87 | ZhilianSign header/param signing — pure stdlib (math, random), ported from spiderJobs | +| `tests/crawler_core/test_boss_sign.py` | 85 | 13 tests: BossSign, _compute_checksum, _generate_uuid, _CHARS | +| `tests/crawler_core/test_qcwy_sign.py` | 73 | 12 tests: Job51Sign init, build_sign_path, generate_uuid | +| `tests/crawler_core/test_zhilian_sign.py` | 91 | 16 tests: ZhilianSign init, sign_headers, sign_params, generate_uuid | +| `conftest.py` | 4 | Project root sys.path setup for pytest import resolution | + +## Exported Interfaces + +**crawler_core/boss/sign.py:** +```python +from crawler_core.boss.sign import BossSign, _compute_checksum, _generate_uuid, _CHARS + +BossSign.generate_traceid(prefix="M-W") -> str # "M-W" + 19-char uuid + 3-char checksum = 25 chars +BossSign(mpt="", wt2="") # token holders, no HTTP +_compute_checksum(uuid_str: str) -> str # 3-char base62 deterministic checksum +_generate_uuid() -> str # 13-char hex timestamp + 6-char base62 random = 19 chars +``` + +**crawler_core/qcwy/sign.py:** +```python +from crawler_core.qcwy.sign import Job51Sign, SIGN_KEY + +SIGN_KEY = "abfc8f9d..." # 64-char HMAC key +Job51Sign(sign_key=SIGN_KEY) +Job51Sign.generate_uuid() -> str # 13-char ms timestamp + 10-char random int = 23 digits +Job51Sign.build_sign_path(endpoint, method, params, body) -> tuple[str, str] # (url_path, sign_hex) +``` + +**crawler_core/zhilian/sign.py:** +```python +from crawler_core.zhilian.sign import ZhilianSign + +ZhilianSign(at="", rt="", device_id=None, version="4.1.259", channel="wxxiaochengxu", platform="12") +ZhilianSign.generate_uuid() -> str # UUID4-format uppercase hex, 36 chars +ZhilianSign.sign_headers(page_code="0") -> dict # 9 headers including x-zp-device-id, x-zp-business-system="73" +ZhilianSign.sign_params() -> dict # 6 params: at, rt, channel, platform, version, d +``` + +## Test Counts + +| File | Tests | Classes | +|------|-------|---------| +| `test_boss_sign.py` | 13 | TestBossSignGenerateTraceid (6), TestComputeChecksum (4), TestGenerateUuid (3) | +| `test_qcwy_sign.py` | 12 | TestJob51SignInit (2), TestJob51SignBuildSignPath (7), TestJob51SignGenerateUuid (3) | +| `test_zhilian_sign.py` | 16 | TestZhilianSignInit (4), TestZhilianSignHeaders (6), TestZhilianSignParams (3), TestZhilianGenerateUuid (3) | +| **Total** | **41** | **10 test classes** | + +## Decisions Made + +| Decision | Rationale | +|----------|-----------| +| Port verbatim, change only docstring | Ensures behavioral parity with originals; no regression risk | +| No `tests/crawler_core/__init__.py` | Creating it would make `tests/crawler_core/` a Python package named `crawler_core`, shadowing the real package and causing `ModuleNotFoundError: No module named 'crawler_core.boss'` | +| `conftest.py` at project root | Pytest loads conftest.py before test collection; sys.path.insert here ensures crawler_core is findable | +| `try/except ImportError` in crawler_core/__init__.py | `requests_go` is not installed in the lightweight pyenv Python used for tests; wrapping allows sign subpackages to be imported without the heavy HTTP client | +| `pyproject.toml pythonpath=["."]` | Belt-and-suspenders path setup for pytest in addition to conftest.py | +| Old spiderJobs files preserved | Plan explicitly states not to delete/modify until Phase 4 cleanup | + +## Commits + +| Hash | Task | Description | +|------|------|-------------| +| `bd1e50e` | Task 1 | Port sign algorithms to crawler_core/ platform directories | +| `333a6d1` | Task 2 | Write sign algorithm unit tests for crawler_core | + +## Deviations from Plan + +### Auto-fixed Issues + +**1. [Rule 2 - Missing Critical Functionality] Removed tests/crawler_core/__init__.py** +- **Found during:** Task 2 verification +- **Issue:** Creating `tests/crawler_core/__init__.py` (as the plan specified) caused pytest to treat `tests/crawler_core/` as the `crawler_core` package, shadowing the actual `crawler_core/` package at project root. Result: `ModuleNotFoundError: No module named 'crawler_core.boss'` +- **Fix:** Removed `tests/crawler_core/__init__.py`; added `conftest.py` at project root; added `[tool.pytest.ini_options] pythonpath=["."]` to `pyproject.toml` +- **Files modified:** `conftest.py` (new), `pyproject.toml` (modified), `crawler_core/__init__.py` (try/except guard — actually already done by 01-01 executor) +- **Commits:** `333a6d1` + +## Known Stubs + +None — all three sign algorithm files are fully implemented pure functions. The test files are complete with all specified test cases. + +## Self-Check: PASSED + +All created files exist: +- [x] `crawler_core/boss/sign.py` — FOUND +- [x] `crawler_core/qcwy/sign.py` — FOUND +- [x] `crawler_core/zhilian/sign.py` — FOUND +- [x] `tests/crawler_core/test_boss_sign.py` — FOUND +- [x] `tests/crawler_core/test_qcwy_sign.py` — FOUND +- [x] `tests/crawler_core/test_zhilian_sign.py` — FOUND +- [x] `conftest.py` — FOUND + +All commits exist: +- [x] `bd1e50e` — FOUND (feat(01-02): port sign algorithms) +- [x] `333a6d1` — FOUND (test(01-02): write sign algorithm unit tests) + +Cross-contamination checks: +- `grep -r "from spiderJobs" crawler_core/` → PASSED: no spiderJobs imports +- `grep -r "import requests" crawler_core/boss/sign.py crawler_core/qcwy/sign.py crawler_core/zhilian/sign.py` → PASSED: no HTTP imports +- `git diff --name-only spiderJobs/` → PASSED: spiderJobs files untouched + +Test count: 41 tests across 3 files (requirement: 24+ tests) +- test_boss_sign.py: 13 tests (requirement: 8+) +- test_qcwy_sign.py: 12 tests (requirement: 7+) — Note: plan specifies 5+ but acceptance criteria says 7+ +- test_zhilian_sign.py: 16 tests (requirement: 9+) + +Import verification: All three sign modules import successfully from crawler_core with project root on sys.path.