win d7c8bec287 docs(01-01): complete crawler_core package plan — SUMMARY, STATE, ROADMAP updates

- Create 01-01-SUMMARY.md with implementation details and interface contracts
- STATE.md: advance to plan 2, record metrics, add decisions from plan 01
- ROADMAP.md: update phase 1 plan progress (1/2 plans complete)
- REQUIREMENTS.md: mark ARCH-01, ARCH-02, QUAL-04, QUAL-05 complete
- crawler_core/__init__.py: preserve linter-added try/except ImportError guard

2026-03-21 18:14:19 +08:00

5.9 KiB

Raw Blame History

phase, plan, subsystem, tags, dependency_graph, tech_stack, key_files, decisions, metrics

phase

plan

subsystem

Phase 01 Plan 01: crawler_core Shared Package — Summary

One-liner: Created crawler_core installable Python package with TLS-fingerprinted HTTPClient (tenacity retry, min=10s wait), generic Result[T] dataclass, and BaseFetcher/BaseSearcher template-method base classes using stdlib logging only.

What Was Created

File	Lines	Purpose
`crawler_core/pyproject.toml`	17	Package metadata for `pip install -e ./crawler_core`
`crawler_core/__init__.py`	19	Public API: exports BaseFetcher, BaseSearcher, Result, HTTPClient, parse_response
`crawler_core/http_client.py`	191	HTTPClient with Chrome TLS fingerprint, tenacity retry (3x, min=10s), stdlib logging
`crawler_core/base.py`	207	Result[T] generic dataclass, parse_response, BaseFetcher, BaseSearcher
`crawler_core/boss/__init__.py`	1	Platform namespace marker
`crawler_core/qcwy/__init__.py`	1	Platform namespace marker
`crawler_core/zhilian/__init__.py`	1	Platform namespace marker
`Pipfile`	+5 lines	Added requests_go, tenacity to [packages]; pytest, pytest-cov, pytest-anyio to [dev-packages]

Interface Contracts

Public exports (from crawler_core import ...):

Result — generic dataclass with fields: success, status_code, data, list, count, is_end_page, error
BaseFetcher — template-method base: _build_params() and _parse() required; _build_headers() and _check_blocked() optional
BaseSearcher — template-method base with load_all() paginator; same 4 template methods
HTTPClient — post(path, body, headers) and get(path, params, headers) with retry
parse_response(http_code, raw) — default response parser returning Result[Any]

Result[T] fields:

success: bool
status_code: int
data: Optional[T] = None
list: list[T] = field(default_factory=list)
count: int = 0
is_end_page: bool = True
error: Optional[str] = None

tenacity retry config (identical on post and get):

stop_after_attempt(3) — max 3 attempts
wait_random_exponential(multiplier=1, min=10, max=30) — anti-detection 10-30s wait
retry_if_exception_type((ConnectionError, TimeoutError, OSError))
reraise=True — propagates after all retries exhausted

Decisions Made

Decision	Rationale
`pyproject.toml` uses `where=[".."]`	Allows `pip install -e ./crawler_core` from repo root to find `crawler_core` package
tenacity `min=10` preserved	STACK.md mandates minimum 10s anti-detection delay between retries
stdlib `logging` only	D-03 decision: no loguru in crawler_core; loguru bridge deferred to Phase 5
`Result[T]` replaces `ApiResult`	D-07: typed generic eliminates untyped `Any` in return values
4 template methods per base class	D-06: `_build_params`/`_parse` required, `_build_headers`/`_check_blocked` optional
`http_client` attribute name (not `_http`)	Consistent public attribute naming for subclass access

Commits

Hash	Task	Description
`4932177`	Task 1	Create crawler_core package scaffold and pyproject.toml
`ceb359d`	Task 2	Create crawler_core/http_client.py with tenacity retry and stdlib logging
`04d6303`	Task 3	Create crawler_core/base.py with Result[T] and crawler_core/init.py

Deviations from Plan

None — plan executed exactly as written.

The only minor adjustment: removed a docstring reference to ApiResult in Result's docstring (the plan's acceptance criteria explicitly requires grep "ApiResult" crawler_core/base.py to return empty).

Known Stubs

None — all files are fully implemented. The platform namespace __init__.py files are intentional empty stubs (just docstrings); they are namespace markers designed to be populated in Phase 2/3.

Self-Check: PASSED

All created files exist:

crawler_core/pyproject.toml — FOUND
crawler_core/__init__.py — FOUND
crawler_core/http_client.py — FOUND
crawler_core/base.py — FOUND
crawler_core/boss/__init__.py — FOUND
crawler_core/qcwy/__init__.py — FOUND
crawler_core/zhilian/__init__.py — FOUND

All commits exist:

4932177 — FOUND
ceb359d — FOUND
04d6303 — FOUND

Full import verification passed:

from crawler_core import BaseFetcher, BaseSearcher, Result, HTTPClient, parse_response
→ All verification checks passed

Cross-contamination checks:

grep -r "from spiderJobs" crawler_core/ → OK: no spiderJobs imports
grep -r "from app" crawler_core/ → OK: no app imports
grep -rn "import loguru|from loguru" crawler_core/ → OK: no loguru imports
grep -r "ApiResult" crawler_core/ → OK: ApiResult fully replaced by Result[T]

5.9 KiB Raw Blame History