- Create 01-01-SUMMARY.md with implementation details and interface contracts - STATE.md: advance to plan 2, record metrics, add decisions from plan 01 - ROADMAP.md: update phase 1 plan progress (1/2 plans complete) - REQUIREMENTS.md: mark ARCH-01, ARCH-02, QUAL-04, QUAL-05 complete - crawler_core/__init__.py: preserve linter-added try/except ImportError guard
5.9 KiB
phase, plan, subsystem, tags, dependency_graph, tech_stack, key_files, decisions, metrics
| phase | plan | subsystem | tags | dependency_graph | tech_stack | key_files | decisions | metrics | |||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 01-shared-core | 01 | crawler_core |
|
|
|
|
|
|
Phase 01 Plan 01: crawler_core Shared Package — Summary
One-liner: Created crawler_core installable Python package with TLS-fingerprinted HTTPClient (tenacity retry, min=10s wait), generic Result[T] dataclass, and BaseFetcher/BaseSearcher template-method base classes using stdlib logging only.
What Was Created
| File | Lines | Purpose |
|---|---|---|
crawler_core/pyproject.toml |
17 | Package metadata for pip install -e ./crawler_core |
crawler_core/__init__.py |
19 | Public API: exports BaseFetcher, BaseSearcher, Result, HTTPClient, parse_response |
crawler_core/http_client.py |
191 | HTTPClient with Chrome TLS fingerprint, tenacity retry (3x, min=10s), stdlib logging |
crawler_core/base.py |
207 | Result[T] generic dataclass, parse_response, BaseFetcher, BaseSearcher |
crawler_core/boss/__init__.py |
1 | Platform namespace marker |
crawler_core/qcwy/__init__.py |
1 | Platform namespace marker |
crawler_core/zhilian/__init__.py |
1 | Platform namespace marker |
Pipfile |
+5 lines | Added requests_go, tenacity to [packages]; pytest, pytest-cov, pytest-anyio to [dev-packages] |
Interface Contracts
Public exports (from crawler_core import ...):
Result— generic dataclass with fields:success,status_code,data,list,count,is_end_page,errorBaseFetcher— template-method base:_build_params()and_parse()required;_build_headers()and_check_blocked()optionalBaseSearcher— template-method base withload_all()paginator; same 4 template methodsHTTPClient—post(path, body, headers)andget(path, params, headers)with retryparse_response(http_code, raw)— default response parser returningResult[Any]
Result[T] fields:
success: bool
status_code: int
data: Optional[T] = None
list: list[T] = field(default_factory=list)
count: int = 0
is_end_page: bool = True
error: Optional[str] = None
tenacity retry config (identical on post and get):
stop_after_attempt(3)— max 3 attemptswait_random_exponential(multiplier=1, min=10, max=30)— anti-detection 10-30s waitretry_if_exception_type((ConnectionError, TimeoutError, OSError))reraise=True— propagates after all retries exhausted
Decisions Made
| Decision | Rationale |
|---|---|
pyproject.toml uses where=[".."] |
Allows pip install -e ./crawler_core from repo root to find crawler_core package |
tenacity min=10 preserved |
STACK.md mandates minimum 10s anti-detection delay between retries |
stdlib logging only |
D-03 decision: no loguru in crawler_core; loguru bridge deferred to Phase 5 |
Result[T] replaces ApiResult |
D-07: typed generic eliminates untyped Any in return values |
| 4 template methods per base class | D-06: _build_params/_parse required, _build_headers/_check_blocked optional |
http_client attribute name (not _http) |
Consistent public attribute naming for subclass access |
Commits
| Hash | Task | Description |
|---|---|---|
4932177 |
Task 1 | Create crawler_core package scaffold and pyproject.toml |
ceb359d |
Task 2 | Create crawler_core/http_client.py with tenacity retry and stdlib logging |
04d6303 |
Task 3 | Create crawler_core/base.py with Result[T] and crawler_core/init.py |
Deviations from Plan
None — plan executed exactly as written.
The only minor adjustment: removed a docstring reference to ApiResult in Result's docstring (the plan's acceptance criteria explicitly requires grep "ApiResult" crawler_core/base.py to return empty).
Known Stubs
None — all files are fully implemented. The platform namespace __init__.py files are intentional empty stubs (just docstrings); they are namespace markers designed to be populated in Phase 2/3.
Self-Check: PASSED
All created files exist:
crawler_core/pyproject.toml— FOUNDcrawler_core/__init__.py— FOUNDcrawler_core/http_client.py— FOUNDcrawler_core/base.py— FOUNDcrawler_core/boss/__init__.py— FOUNDcrawler_core/qcwy/__init__.py— FOUNDcrawler_core/zhilian/__init__.py— FOUND
All commits exist:
4932177— FOUNDceb359d— FOUND04d6303— FOUND
Full import verification passed:
from crawler_core import BaseFetcher, BaseSearcher, Result, HTTPClient, parse_response
→ All verification checks passed
Cross-contamination checks:
grep -r "from spiderJobs" crawler_core/→ OK: no spiderJobs importsgrep -r "from app" crawler_core/→ OK: no app importsgrep -rn "import loguru|from loguru" crawler_core/→ OK: no loguru importsgrep -r "ApiResult" crawler_core/→ OK: ApiResult fully replaced by Result[T]