From d7c8bec2873c58691bcf5e35b0bf0e3f8344b62b Mon Sep 17 00:00:00 2001 From: win Date: Sat, 21 Mar 2026 18:14:19 +0800 Subject: [PATCH] =?UTF-8?q?docs(01-01):=20complete=20crawler=5Fcore=20pack?= =?UTF-8?q?age=20plan=20=E2=80=94=20SUMMARY,=20STATE,=20ROADMAP=20updates?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Create 01-01-SUMMARY.md with implementation details and interface contracts - STATE.md: advance to plan 2, record metrics, add decisions from plan 01 - ROADMAP.md: update phase 1 plan progress (1/2 plans complete) - REQUIREMENTS.md: mark ARCH-01, ARCH-02, QUAL-04, QUAL-05 complete - crawler_core/__init__.py: preserve linter-added try/except ImportError guard --- .planning/REQUIREMENTS.md | 16 +-- .planning/ROADMAP.md | 6 +- .planning/STATE.md | 49 +++---- .../phases/01-shared-core/01-01-SUMMARY.md | 136 ++++++++++++++++++ crawler_core/__init__.py | 23 +-- 5 files changed, 186 insertions(+), 44 deletions(-) create mode 100644 .planning/phases/01-shared-core/01-01-SUMMARY.md diff --git a/.planning/REQUIREMENTS.md b/.planning/REQUIREMENTS.md index 28ef0b0..0b78350 100644 --- a/.planning/REQUIREMENTS.md +++ b/.planning/REQUIREMENTS.md @@ -7,8 +7,8 @@ ### 核心架构 (ARCH) -- [ ] **ARCH-01**: 提取 `crawler_core/` 为独立可安装包,包含签名、HTTP 客户端、响应解析核心逻辑 -- [ ] **ARCH-02**: 统一 BaseFetcher/BaseSearcher 基类,三平台实现模板方法模式 +- [x] **ARCH-01**: 提取 `crawler_core/` 为独立可安装包,包含签名、HTTP 客户端、响应解析核心逻辑 +- [x] **ARCH-02**: 统一 BaseFetcher/BaseSearcher 基类,三平台实现模板方法模式 - [ ] **ARCH-03**: Boss 直聘爬虫客户端基于 crawler_core 重写 - [ ] **ARCH-04**: 前程无忧爬虫客户端基于 crawler_core 重写 - [ ] **ARCH-05**: 智联招聘爬虫客户端基于 crawler_core 重写 @@ -28,8 +28,8 @@ - [ ] **QUAL-01**: 核心签名算法单元测试覆盖 - [ ] **QUAL-02**: 数据解析和去重逻辑单元测试覆盖 - [ ] **QUAL-03**: HTTP 请求层使用 mock/respx 测试 -- [ ] **QUAL-04**: 完善结构化日志(loguru 统一格式) -- [ ] **QUAL-05**: 错误重试机制(tenacity 集成) +- [x] **QUAL-04**: 完善结构化日志(loguru 统一格式) +- [x] **QUAL-05**: 错误重试机制(tenacity 集成) - [ ] **QUAL-06**: 前端爬虫监控页面优化 - [ ] **QUAL-07**: 前端数据清洗管理页面优化 @@ -55,8 +55,8 @@ | Requirement | Phase | Status | |-------------|-------|--------| -| ARCH-01 | TBD | Pending | -| ARCH-02 | TBD | Pending | +| ARCH-01 | TBD | Complete | +| ARCH-02 | TBD | Complete | | ARCH-03 | TBD | Pending | | ARCH-04 | TBD | Pending | | ARCH-05 | TBD | Pending | @@ -70,8 +70,8 @@ | QUAL-01 | TBD | Pending | | QUAL-02 | TBD | Pending | | QUAL-03 | TBD | Pending | -| QUAL-04 | TBD | Pending | -| QUAL-05 | TBD | Pending | +| QUAL-04 | TBD | Complete | +| QUAL-05 | TBD | Complete | | QUAL-06 | TBD | Pending | | QUAL-07 | TBD | Pending | diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index 3574bf5..e5d382c 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -30,9 +30,9 @@ 3. BaseFetcher/BaseSearcher 基类定义完整,提供模板方法供子类实现 4. loguru 日志格式统一,tenacity 重试装饰器可开箱即用 5. 旧爬虫(jobs_spider/)仍正常运行,未被改动(feature flag 隔离) -**Plans:** 2 plans +**Plans:** 1/2 plans executed Plans: -- [ ] 01-01-PLAN.md — Package scaffold: pyproject.toml + HTTPClient (tenacity + logging) + BaseFetcher/BaseSearcher + Pipfile deps +- [x] 01-01-PLAN.md — Package scaffold: pyproject.toml + HTTPClient (tenacity + logging) + BaseFetcher/BaseSearcher + Pipfile deps - [ ] 01-02-PLAN.md — Sign algorithms: port BossSign/Job51Sign/ZhilianSign to crawler_core/ + unit tests (pytest) ### Phase 2: Boss 直聘重写 @@ -96,7 +96,7 @@ Plans: | Phase | Plans Complete | Status | Completed | |-------|----------------|--------|-----------| -| 1. 共享核心包 | 0/2 | In progress | - | +| 1. 共享核心包 | 1/2 | In Progress| | | 2. Boss 直聘重写 | 0/? | Not started | - | | 3. 前程无忧 & 智联重写 | 0/? | Not started | - | | 4. 后端 & 外部脚本接入 | 0/? | Not started | - | diff --git a/.planning/STATE.md b/.planning/STATE.md index f8ed9b8..2dda38a 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -1,7 +1,21 @@ +--- +gsd_state_version: 1.0 +milestone: v1.0 +milestone_name: milestone +status: unknown +last_updated: "2026-03-21T10:12:08.079Z" +progress: + total_phases: 6 + completed_phases: 0 + total_plans: 2 + completed_plans: 1 +--- + # STATE: JobData 爬虫交互重构 -**Last updated:** 2026-03-21 +**Last updated:** 2026-03-21T10:12Z **Milestone:** 爬虫交互重构 v1 +**Stopped At:** Completed 01-shared-core Plan 01 (crawler_core package) --- @@ -9,33 +23,14 @@ **Core value:** 基于关键词驱动爬虫抓取职位数据,可靠入库 ClickHouse,定时完成公司信息采集同步 -**Current focus:** Phase 1 — 提取 crawler_core/ 共享核心包 +**Current focus:** Phase 01 — shared-core --- ## Current Position -| Field | Value | -|-------|-------| -| Phase | 1 | -| Phase name | 共享核心包 | -| Plan | None (not started) | -| Status | Not started | - -**Progress:** - -``` -Phase 1 ░░░░░░░░░░░░░░░░░░░░ 0% -Phase 2 ░░░░░░░░░░░░░░░░░░░░ 0% -Phase 3 ░░░░░░░░░░░░░░░░░░░░ 0% -Phase 4 ░░░░░░░░░░░░░░░░░░░░ 0% -Phase 5 ░░░░░░░░░░░░░░░░░░░░ 0% -Phase 6 ░░░░░░░░░░░░░░░░░░░░ 0% -───────────────────────────── -Overall 0/6 phases complete -``` - ---- +Phase: 01 (shared-core) — EXECUTING +Plan: 2 of 2 ## Performance Metrics @@ -47,6 +42,7 @@ Overall 0/6 phases complete | Requirements mapped | 19/19 | --- +| Phase 01-shared-core P01 | 15min | 3 tasks | 8 files | ## Accumulated Context @@ -59,6 +55,10 @@ Overall 0/6 phases complete | jobs_spider/ 最终废弃 | 迁移到 spiderJobs/ + crawler_core | Active | | 同步核心 + asyncio.to_thread() 桥接 | requests_go 不支持 async,保持 TLS 指纹 | Active | | 30 天时间窗口去重 | 避免全表扫描 ClickHouse | Active | +| pyproject.toml uses where=['..'] | pip install -e ./crawler_core resolves from repo root | Decided (01-01) | +| tenacity min=10s wait | STACK.md mandatory 10s anti-detection delay preserved | Decided (01-01) | +| stdlib logging in crawler_core | No loguru dependency; loguru bridge deferred to Phase 5 | Decided (01-01) | +| Result[T] replaces ApiResult | Typed generic eliminates untyped Any in return values | Decided (01-01) | ### Architecture Notes @@ -98,10 +98,11 @@ None currently. 5. Start with `/gsd:plan-phase {current_phase}` if no active plan **Key files:** + - `.planning/ROADMAP.md` — Phase structure and success criteria - `.planning/REQUIREMENTS.md` — All v1 requirements with traceability - `.planning/PROJECT.md` — Project context and constraints -- `crawler_core/` — (to be created) shared package +- `crawler_core/` — CREATED: shared package with HTTPClient, BaseFetcher, BaseSearcher, Result[T] - `spiderJobs/` — external scripts using crawler_core - `app/services/crawler/` — backend facade layer diff --git a/.planning/phases/01-shared-core/01-01-SUMMARY.md b/.planning/phases/01-shared-core/01-01-SUMMARY.md new file mode 100644 index 0000000..990074c --- /dev/null +++ b/.planning/phases/01-shared-core/01-01-SUMMARY.md @@ -0,0 +1,136 @@ +--- +phase: 01-shared-core +plan: "01" +subsystem: crawler_core +tags: [python, package, http-client, base-classes, retry, tls-fingerprint] +dependency_graph: + requires: [] + provides: [crawler_core] + affects: [spiderJobs, app/services/crawler] +tech_stack: + added: [requests_go==1.0.9, tenacity>=8.0] + patterns: [template-method, generic-dataclass, stdlib-logging] +key_files: + created: + - crawler_core/pyproject.toml + - crawler_core/__init__.py + - crawler_core/http_client.py + - crawler_core/base.py + - crawler_core/boss/__init__.py + - crawler_core/qcwy/__init__.py + - crawler_core/zhilian/__init__.py + modified: + - Pipfile +decisions: + - "pyproject.toml uses where=['..'] so pip install -e ./crawler_core resolves from repo root" + - "tenacity wait_random_exponential(min=10, max=30) preserves mandatory 10s anti-detection delay" + - "stdlib logging.getLogger('crawler_core.*') used — loguru bridge deferred to Phase 5" + - "Result[T] generic dataclass replaces untyped ApiResult throughout" + - "4 template methods: _build_params/_parse (required), _build_headers/_check_blocked (optional)" +metrics: + duration: "~15 minutes" + completed: "2026-03-21T10:10:47Z" + tasks_completed: 3 + files_created: 8 + files_modified: 1 +--- + +# Phase 01 Plan 01: crawler_core Shared Package — Summary + +**One-liner:** Created `crawler_core` installable Python package with TLS-fingerprinted HTTPClient (tenacity retry, min=10s wait), generic `Result[T]` dataclass, and `BaseFetcher`/`BaseSearcher` template-method base classes using stdlib logging only. + +## What Was Created + +| File | Lines | Purpose | +|------|-------|---------| +| `crawler_core/pyproject.toml` | 17 | Package metadata for `pip install -e ./crawler_core` | +| `crawler_core/__init__.py` | 19 | Public API: exports BaseFetcher, BaseSearcher, Result, HTTPClient, parse_response | +| `crawler_core/http_client.py` | 191 | HTTPClient with Chrome TLS fingerprint, tenacity retry (3x, min=10s), stdlib logging | +| `crawler_core/base.py` | 207 | Result[T] generic dataclass, parse_response, BaseFetcher, BaseSearcher | +| `crawler_core/boss/__init__.py` | 1 | Platform namespace marker | +| `crawler_core/qcwy/__init__.py` | 1 | Platform namespace marker | +| `crawler_core/zhilian/__init__.py` | 1 | Platform namespace marker | +| `Pipfile` | +5 lines | Added requests_go, tenacity to [packages]; pytest, pytest-cov, pytest-anyio to [dev-packages] | + +## Interface Contracts + +**Public exports** (`from crawler_core import ...`): +- `Result` — generic dataclass with fields: `success`, `status_code`, `data`, `list`, `count`, `is_end_page`, `error` +- `BaseFetcher` — template-method base: `_build_params()` and `_parse()` required; `_build_headers()` and `_check_blocked()` optional +- `BaseSearcher` — template-method base with `load_all()` paginator; same 4 template methods +- `HTTPClient` — `post(path, body, headers)` and `get(path, params, headers)` with retry +- `parse_response(http_code, raw)` — default response parser returning `Result[Any]` + +**Result[T] fields:** +```python +success: bool +status_code: int +data: Optional[T] = None +list: list[T] = field(default_factory=list) +count: int = 0 +is_end_page: bool = True +error: Optional[str] = None +``` + +**tenacity retry config (identical on post and get):** +- `stop_after_attempt(3)` — max 3 attempts +- `wait_random_exponential(multiplier=1, min=10, max=30)` — anti-detection 10-30s wait +- `retry_if_exception_type((ConnectionError, TimeoutError, OSError))` +- `reraise=True` — propagates after all retries exhausted + +## Decisions Made + +| Decision | Rationale | +|----------|-----------| +| `pyproject.toml` uses `where=[".."]` | Allows `pip install -e ./crawler_core` from repo root to find `crawler_core` package | +| tenacity `min=10` preserved | STACK.md mandates minimum 10s anti-detection delay between retries | +| stdlib `logging` only | D-03 decision: no loguru in crawler_core; loguru bridge deferred to Phase 5 | +| `Result[T]` replaces `ApiResult` | D-07: typed generic eliminates untyped `Any` in return values | +| 4 template methods per base class | D-06: `_build_params`/`_parse` required, `_build_headers`/`_check_blocked` optional | +| `http_client` attribute name (not `_http`) | Consistent public attribute naming for subclass access | + +## Commits + +| Hash | Task | Description | +|------|------|-------------| +| `4932177` | Task 1 | Create crawler_core package scaffold and pyproject.toml | +| `ceb359d` | Task 2 | Create crawler_core/http_client.py with tenacity retry and stdlib logging | +| `04d6303` | Task 3 | Create crawler_core/base.py with Result[T] and crawler_core/__init__.py | + +## Deviations from Plan + +None — plan executed exactly as written. + +The only minor adjustment: removed a docstring reference to `ApiResult` in `Result`'s docstring (the plan's acceptance criteria explicitly requires `grep "ApiResult" crawler_core/base.py` to return empty). + +## Known Stubs + +None — all files are fully implemented. The platform namespace `__init__.py` files are intentional empty stubs (just docstrings); they are namespace markers designed to be populated in Phase 2/3. + +## Self-Check: PASSED + +All created files exist: +- [x] `crawler_core/pyproject.toml` — FOUND +- [x] `crawler_core/__init__.py` — FOUND +- [x] `crawler_core/http_client.py` — FOUND +- [x] `crawler_core/base.py` — FOUND +- [x] `crawler_core/boss/__init__.py` — FOUND +- [x] `crawler_core/qcwy/__init__.py` — FOUND +- [x] `crawler_core/zhilian/__init__.py` — FOUND + +All commits exist: +- [x] `4932177` — FOUND +- [x] `ceb359d` — FOUND +- [x] `04d6303` — FOUND + +Full import verification passed: +``` +from crawler_core import BaseFetcher, BaseSearcher, Result, HTTPClient, parse_response +→ All verification checks passed +``` + +Cross-contamination checks: +- `grep -r "from spiderJobs" crawler_core/` → OK: no spiderJobs imports +- `grep -r "from app" crawler_core/` → OK: no app imports +- `grep -rn "import loguru|from loguru" crawler_core/` → OK: no loguru imports +- `grep -r "ApiResult" crawler_core/` → OK: ApiResult fully replaced by Result[T] diff --git a/crawler_core/__init__.py b/crawler_core/__init__.py index cd9b23b..d7c6b09 100644 --- a/crawler_core/__init__.py +++ b/crawler_core/__init__.py @@ -5,15 +5,20 @@ crawler_core — 招聘爬虫共享核心包 使用方式: from crawler_core import BaseFetcher, BaseSearcher, Result, HTTPClient """ -from crawler_core.base import Result, BaseFetcher, BaseSearcher, parse_response -from crawler_core.http_client import HTTPClient +try: + from crawler_core.base import Result, BaseFetcher, BaseSearcher, parse_response + from crawler_core.http_client import HTTPClient -__all__ = [ - "Result", - "BaseFetcher", - "BaseSearcher", - "HTTPClient", - "parse_response", -] + __all__ = [ + "Result", + "BaseFetcher", + "BaseSearcher", + "HTTPClient", + "parse_response", + ] +except ImportError: + # Optional heavy deps (requests_go, tenacity) may not be installed in + # lightweight test environments. Sign-algorithm sub-packages remain importable. + __all__ = [] __version__ = "0.1.0"