win d7c8bec287 docs(01-01): complete crawler_core package plan — SUMMARY, STATE, ROADMAP updates
- Create 01-01-SUMMARY.md with implementation details and interface contracts
- STATE.md: advance to plan 2, record metrics, add decisions from plan 01
- ROADMAP.md: update phase 1 plan progress (1/2 plans complete)
- REQUIREMENTS.md: mark ARCH-01, ARCH-02, QUAL-04, QUAL-05 complete
- crawler_core/__init__.py: preserve linter-added try/except ImportError guard
2026-03-21 18:14:19 +08:00

5.9 KiB

phase, plan, subsystem, tags, dependency_graph, tech_stack, key_files, decisions, metrics
phase plan subsystem tags dependency_graph tech_stack key_files decisions metrics
01-shared-core 01 crawler_core
python
package
http-client
base-classes
retry
tls-fingerprint
requires provides affects
crawler_core
spiderJobs
app/services/crawler
added patterns
requests_go==1.0.9
tenacity>=8.0
template-method
generic-dataclass
stdlib-logging
created modified
crawler_core/pyproject.toml
crawler_core/__init__.py
crawler_core/http_client.py
crawler_core/base.py
crawler_core/boss/__init__.py
crawler_core/qcwy/__init__.py
crawler_core/zhilian/__init__.py
Pipfile
pyproject.toml uses where=['..'] so pip install -e ./crawler_core resolves from repo root
tenacity wait_random_exponential(min=10, max=30) preserves mandatory 10s anti-detection delay
stdlib logging.getLogger('crawler_core.*') used — loguru bridge deferred to Phase 5
Result[T] generic dataclass replaces untyped ApiResult throughout
4 template methods: _build_params/_parse (required), _build_headers/_check_blocked (optional)
duration completed tasks_completed files_created files_modified
~15 minutes 2026-03-21T10:10:47Z 3 8 1

Phase 01 Plan 01: crawler_core Shared Package — Summary

One-liner: Created crawler_core installable Python package with TLS-fingerprinted HTTPClient (tenacity retry, min=10s wait), generic Result[T] dataclass, and BaseFetcher/BaseSearcher template-method base classes using stdlib logging only.

What Was Created

File Lines Purpose
crawler_core/pyproject.toml 17 Package metadata for pip install -e ./crawler_core
crawler_core/__init__.py 19 Public API: exports BaseFetcher, BaseSearcher, Result, HTTPClient, parse_response
crawler_core/http_client.py 191 HTTPClient with Chrome TLS fingerprint, tenacity retry (3x, min=10s), stdlib logging
crawler_core/base.py 207 Result[T] generic dataclass, parse_response, BaseFetcher, BaseSearcher
crawler_core/boss/__init__.py 1 Platform namespace marker
crawler_core/qcwy/__init__.py 1 Platform namespace marker
crawler_core/zhilian/__init__.py 1 Platform namespace marker
Pipfile +5 lines Added requests_go, tenacity to [packages]; pytest, pytest-cov, pytest-anyio to [dev-packages]

Interface Contracts

Public exports (from crawler_core import ...):

  • Result — generic dataclass with fields: success, status_code, data, list, count, is_end_page, error
  • BaseFetcher — template-method base: _build_params() and _parse() required; _build_headers() and _check_blocked() optional
  • BaseSearcher — template-method base with load_all() paginator; same 4 template methods
  • HTTPClientpost(path, body, headers) and get(path, params, headers) with retry
  • parse_response(http_code, raw) — default response parser returning Result[Any]

Result[T] fields:

success: bool
status_code: int
data: Optional[T] = None
list: list[T] = field(default_factory=list)
count: int = 0
is_end_page: bool = True
error: Optional[str] = None

tenacity retry config (identical on post and get):

  • stop_after_attempt(3) — max 3 attempts
  • wait_random_exponential(multiplier=1, min=10, max=30) — anti-detection 10-30s wait
  • retry_if_exception_type((ConnectionError, TimeoutError, OSError))
  • reraise=True — propagates after all retries exhausted

Decisions Made

Decision Rationale
pyproject.toml uses where=[".."] Allows pip install -e ./crawler_core from repo root to find crawler_core package
tenacity min=10 preserved STACK.md mandates minimum 10s anti-detection delay between retries
stdlib logging only D-03 decision: no loguru in crawler_core; loguru bridge deferred to Phase 5
Result[T] replaces ApiResult D-07: typed generic eliminates untyped Any in return values
4 template methods per base class D-06: _build_params/_parse required, _build_headers/_check_blocked optional
http_client attribute name (not _http) Consistent public attribute naming for subclass access

Commits

Hash Task Description
4932177 Task 1 Create crawler_core package scaffold and pyproject.toml
ceb359d Task 2 Create crawler_core/http_client.py with tenacity retry and stdlib logging
04d6303 Task 3 Create crawler_core/base.py with Result[T] and crawler_core/init.py

Deviations from Plan

None — plan executed exactly as written.

The only minor adjustment: removed a docstring reference to ApiResult in Result's docstring (the plan's acceptance criteria explicitly requires grep "ApiResult" crawler_core/base.py to return empty).

Known Stubs

None — all files are fully implemented. The platform namespace __init__.py files are intentional empty stubs (just docstrings); they are namespace markers designed to be populated in Phase 2/3.

Self-Check: PASSED

All created files exist:

  • crawler_core/pyproject.toml — FOUND
  • crawler_core/__init__.py — FOUND
  • crawler_core/http_client.py — FOUND
  • crawler_core/base.py — FOUND
  • crawler_core/boss/__init__.py — FOUND
  • crawler_core/qcwy/__init__.py — FOUND
  • crawler_core/zhilian/__init__.py — FOUND

All commits exist:

  • 4932177 — FOUND
  • ceb359d — FOUND
  • 04d6303 — FOUND

Full import verification passed:

from crawler_core import BaseFetcher, BaseSearcher, Result, HTTPClient, parse_response
→ All verification checks passed

Cross-contamination checks:

  • grep -r "from spiderJobs" crawler_core/ → OK: no spiderJobs imports
  • grep -r "from app" crawler_core/ → OK: no app imports
  • grep -rn "import loguru|from loguru" crawler_core/ → OK: no loguru imports
  • grep -r "ApiResult" crawler_core/ → OK: ApiResult fully replaced by Result[T]