27 KiB
phase, plan, type, wave, depends_on, files_modified, autonomous, requirements, must_haves
| phase | plan | type | wave | depends_on | files_modified | autonomous | requirements | must_haves | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 01-shared-core | 01 | execute | 1 |
|
true |
|
|
Purpose: This is the foundation everything else depends on. Once installed with pip install -e ./crawler_core, Phase 2/3 platform rewrites can import from it instead of copying code.
Output: A working Python package at crawler_core/ that installs cleanly and exposes BaseFetcher, BaseSearcher, Result[T], and HTTPClient.
<execution_context>
@/.claude/get-shit-done/workflows/execute-plan.md
@/.claude/get-shit-done/templates/summary.md
</execution_context>
From spiderJobs/core/http_client.py:
class HTTPClient:
def __init__(self, base_url, default_headers=None, proxy=None,
tunnel_proxy=None, proxy_pool=None, timeout=10): ...
def _new_session(self) -> requests.Session: ...
def _get_proxies(self) -> Optional[dict]: ...
def _merge_headers(self, extra=None) -> dict: ...
def post(self, path, body, headers=None) -> tuple[int, Any]: ...
def get(self, path, params=None, headers=None) -> tuple[int, Any]: ...
From spiderJobs/core/base.py (reference only — DO NOT copy ApiResult; use Result[T] instead per D-07):
@dataclass
class ApiResult: # <-- OLD: replaced by Result[T] in crawler_core/base.py
success: bool
status_code: int
data: Any = None
list: list[dict] = field(default_factory=list)
count: int = 0
is_end_page: bool = True
error: Optional[str] = None
def parse_response(http_code: int, raw: Any) -> ApiResult: ...
class BaseFetcher:
ENDPOINT: str = ""
def __init__(self, http_client: HTTPClient): ...
def _build_params(self) -> dict: raise NotImplementedError # template method (required)
def _parse(self, http_code, raw) -> ApiResult: ...
def fetch(self) -> ApiResult: ...
class BaseSearcher:
ENDPOINT: str = ""
def __init__(self, page_size=15, http_client=None): ...
def _build_params(self, page_index) -> dict: raise NotImplementedError # template method (required)
def _request(self, params) -> tuple[int, Any]: ...
def _parse(self, http_code, raw) -> ApiResult: ...
def search(self, page_index=1) -> ApiResult: ...
def load_all(self, max_pages=10, on_page=None) -> list[dict]: ...
Task 1: Create crawler_core package scaffold and pyproject.toml
- /Users/win/2025/AICoding/JobData/pyproject.toml (understand existing project config format)
- /Users/win/2025/AICoding/JobData/Pipfile (understand dependency structure to add entries)
- /Users/win/2025/AICoding/JobData/.planning/phases/01-shared-core/1-CONTEXT.md (decisions D-01 through D-04)
crawler_core/pyproject.toml
crawler_core/boss/__init__.py
crawler_core/qcwy/__init__.py
crawler_core/zhilian/__init__.py
Pipfile
Create the crawler_core/ directory structure and configure it as an installable Python package.
Step 1: Create crawler_core/pyproject.toml
[build-system]
requires = ["setuptools>=68"]
build-backend = "setuptools.backends.legacy:build"
[project]
name = "crawler_core"
version = "0.1.0"
description = "Shared crawler core — sign algorithms, HTTP client, base classes"
requires-python = ">=3.11"
dependencies = [
"requests_go==1.0.9",
"tenacity>=8.0",
]
[tool.setuptools.packages.find]
where = [".."]
include = ["crawler_core*"]
NOTE: where = [".."] means setuptools finds the crawler_core package by looking one level up from the pyproject.toml, which is at the repo root. This makes pip install -e ./crawler_core resolve correctly.
Step 2: Create platform namespace init.py files (empty)
Create these three files with a single docstring only — NO imports, they are just namespace markers:
crawler_core/boss/__init__.py:"""Boss直聘 platform module."""crawler_core/qcwy/__init__.py:"""前程无忧 (51Job) platform module."""crawler_core/zhilian/__init__.py:"""智联招聘 platform module."""
Step 3: Add dependencies to Pipfile
In the [packages] section (before [dev-packages]), add these two lines (after playwright = "==1.57.0"):
requests_go = "==1.0.9"
tenacity = ">=8.0"
In the [dev-packages] section, add:
pytest = ">=8.0"
pytest-cov = ">=4.0"
pytest-anyio = "*"
What NOT to do:
- Do NOT create a crawler_core/init.py in this task (Task 2 creates it)
- Do NOT create crawler_core/http_client.py or crawler_core/base.py (Task 2 and 3)
- Do NOT run
pip install— just write the files python -c "import tomllib; d=tomllib.load(open('/Users/win/2025/AICoding/JobData/crawler_core/pyproject.toml','rb')); assert d['project']['name']=='crawler_core'; print('pyproject.toml OK')" && grep -q "requests_go" /Users/win/2025/AICoding/JobData/Pipfile && grep -q "tenacity" /Users/win/2025/AICoding/JobData/Pipfile && grep -q "pytest" /Users/win/2025/AICoding/JobData/Pipfile && echo "Pipfile OK" <acceptance_criteria>crawler_core/pyproject.tomlexists and containsname = "crawler_core",requires-python = ">=3.11",requests_go==1.0.9,tenacity>=8.0crawler_core/boss/__init__.py,crawler_core/qcwy/__init__.py,crawler_core/zhilian/__init__.pyall exist (can be empty docstrings)Pipfile[packages] section containsrequests_go = "==1.0.9"andtenacity = ">=8.0"Pipfile[dev-packages] section containspytest,pytest-cov,pytest-anyiogrep -c "requests_go" /Users/win/2025/AICoding/JobData/Pipfileoutputs1(no duplicates) </acceptance_criteria> Package directory structure created, pyproject.toml valid, dependencies declared in Pipfile.
The file must be exactly crawler_core/http_client.py — no subdirectory.
Imports to use (CRITICAL — per D-03, only requests_go + stdlib + tenacity):
from __future__ import annotations
import logging
import random
from typing import Any, Optional
import requests_go as requests
from requests_go.tls_config import TLS_CHROME_LATEST
from tenacity import (
retry,
retry_if_exception_type,
stop_after_attempt,
wait_random_exponential,
)
Logging setup (module-level, before the class):
logger = logging.getLogger("crawler_core.http_client")
This uses stdlib logging — NOT loguru (per D-03, loguru is excluded from crawler_core). Callers (app/services/crawler/) can configure loguru to bridge stdlib if desired.
Class structure: Copy the full HTTPClient class from spiderJobs/core/http_client.py EXACTLY, then make these targeted changes:
-
Keep all existing methods unchanged:
__init__,_new_session,_get_proxies,_merge_headers -
Wrap
post()with tenacity retry decorator:
@retry(
stop=stop_after_attempt(3),
wait=wait_random_exponential(multiplier=1, min=10, max=30),
retry=retry_if_exception_type((ConnectionError, TimeoutError, OSError)),
reraise=True,
before_sleep=lambda retry_state: logger.warning(
"HTTP retry attempt=%d url=%s error=%s",
retry_state.attempt_number,
retry_state.args[1] if retry_state.args else "unknown",
retry_state.outcome.exception(),
),
)
def post(self, path: str, body: dict, headers: Optional[dict] = None) -> tuple[int, Any]:
"""发送 POST 请求"""
# ... existing body unchanged ...
logger.debug("POST %s%s", self.base_url, path)
# existing try/finally logic unchanged
- Wrap
get()with the same tenacity retry decorator (identical decorator, same pattern):
@retry(
stop=stop_after_attempt(3),
wait=wait_random_exponential(multiplier=1, min=10, max=30),
retry=retry_if_exception_type((ConnectionError, TimeoutError, OSError)),
reraise=True,
before_sleep=lambda retry_state: logger.warning(
"HTTP retry attempt=%d error=%s",
retry_state.attempt_number,
retry_state.outcome.exception(),
),
)
def get(self, path: str, params: Optional[dict] = None, headers: Optional[dict] = None) -> tuple[int, Any]:
"""发送 GET 请求"""
logger.debug("GET %s%s", self.base_url, path)
# ... existing body unchanged ...
- Add module docstring at the top:
"""
crawler_core.http_client — 通用 HTTP 客户端
基于 requests-go,自带 Chrome TLS 指纹伪装(TLS_CHROME_LATEST + random_ja3=True)。
支持代理 IP / 隧道代理 / 代理池轮换。
内置 tenacity 重试(3次,指数退避,最小10秒间隔)。
使用 stdlib logging — 上层可通过 logging.getLogger('crawler_core') 配置。
不依赖 loguru / FastAPI / Tortoise-ORM 等应用框架。
"""
Minimum 10 second wait is MANDATORY — min=10 in wait_random_exponential preserves the anti-detection delay requirement from STACK.md.
Do NOT:
- Change the proxy logic (keep tunnel_proxy / proxy_pool / fixed proxy logic identical)
- Import loguru
- Import anything from
spiderJobs.*orapp.*cd /Users/win/2025/AICoding/JobData && python -c " import sys sys.path.insert(0, '.') from crawler_core.http_client import HTTPClient import inspect, logging src = inspect.getsource(HTTPClient.post) assert 'retry' in src or '@retry' in dir(HTTPClient.post), 'tenacity decorator missing on post' assert 'logger' in src or 'logging' in src, 'logging missing in post' print('HTTPClient OK') " <acceptance_criteria>crawler_core/http_client.pyexists and is importable:from crawler_core.http_client import HTTPClientsucceeds (after adding crawler_core to sys.path)- File contains
from tenacity import retryin imports - File contains
logger = logging.getLogger("crawler_core.http_client") - File contains
wait_random_exponential(multiplier=1, min=10, max=30)— exact values - File contains
stop_after_attempt(3) - File does NOT contain
import loguruorfrom loguruanywhere - File does NOT contain
from spiderJobsorfrom appanywhere - File is under 200 lines (source is 155 lines + ~30 lines of additions)
grep -c "from tenacity" /Users/win/2025/AICoding/JobData/crawler_core/http_client.pyoutputs1grep "min=10" /Users/win/2025/AICoding/JobData/crawler_core/http_client.pyhas output </acceptance_criteria> HTTPClient ported with retry (3 attempts, min=10s wait) and stdlib logging. No loguru. No spiderJobs imports.
crawler_core/base.py:
Add module docstring at the top:
"""
crawler_core.base — 通用基类与数据结构
提供所有招聘平台共用的: Result, BaseFetcher, BaseSearcher, parse_response
不依赖任何平台特定代码。
"""
Step 1: Generic Result[T] dataclass (replaces ApiResult — per D-07)
from __future__ import annotations
import logging
from dataclasses import dataclass, field
from typing import Any, Generic, Optional, TypeVar
from crawler_core.http_client import HTTPClient
T = TypeVar("T")
_logger = logging.getLogger("crawler_core.base")
@dataclass
class Result(Generic[T]):
"""Typed result wrapper returned by all BaseFetcher and BaseSearcher methods.
Replaces the untyped ApiResult. Callers annotate as Result[MyJobModel] etc.
"""
success: bool
status_code: int
data: Optional[T] = None
list: list[T] = field(default_factory=list)
count: int = 0
is_end_page: bool = True
error: Optional[str] = None
Step 2: parse_response — adapt from spiderJobs/core/base.py but return Result[Any]
Port parse_response(http_code, raw) from spiderJobs/core/base.py verbatim, changing only the return type annotation from ApiResult to Result[Any].
Step 3: BaseFetcher — 4 template methods (per D-06)
class BaseFetcher:
"""Template-method base class for single-item fetchers.
Required overrides: _build_params(), _parse()
Optional overrides: _build_headers(), _check_blocked()
"""
ENDPOINT: str = ""
def __init__(self, http_client: HTTPClient) -> None:
self.http_client = http_client
# --- Required template methods ---
def _build_params(self) -> dict:
"""Build query/body parameters for the request. MUST be overridden."""
raise NotImplementedError(f"{type(self).__name__} must implement _build_params()")
def _parse(self, http_code: int, raw: Any) -> Result:
"""Parse the HTTP response into a Result. MUST be overridden."""
raise NotImplementedError(f"{type(self).__name__} must implement _parse()")
# --- Optional template methods ---
def _build_headers(self) -> dict:
"""Build extra request headers. Override to add platform-specific headers.
Default: returns {} (no extra headers beyond HTTPClient defaults).
"""
return {}
def _check_blocked(self, status_code: int, body: str) -> bool:
"""Detect platform-specific anti-crawl blocks.
Override to inspect response body/status for block signals.
Default: returns False (assume not blocked).
"""
return False
# --- Orchestration ---
def fetch(self) -> Result:
"""Execute the fetch: build params → request → check blocked → parse."""
params = self._build_params()
extra_headers = self._build_headers()
http_code, raw = self.http_client.get(
self.ENDPOINT, params=params, headers=extra_headers or None
)
raw_str = str(raw) if not isinstance(raw, str) else raw
if self._check_blocked(http_code, raw_str):
return Result(success=False, status_code=http_code, error="blocked")
return self._parse(http_code, raw)
Step 4: BaseSearcher — 4 template methods (per D-06)
class BaseSearcher:
"""Template-method base class for paginated list searchers.
Required overrides: _build_params(), _parse()
Optional overrides: _build_headers(), _check_blocked()
"""
ENDPOINT: str = ""
def __init__(self, page_size: int = 15, http_client: Optional[HTTPClient] = None) -> None:
self.page_size = page_size
self.http_client = http_client
# --- Required template methods ---
def _build_params(self, page_index: int) -> dict:
"""Build pagination query params. MUST be overridden."""
raise NotImplementedError(f"{type(self).__name__} must implement _build_params()")
def _parse(self, http_code: int, raw: Any) -> Result:
"""Parse the HTTP response into a Result. MUST be overridden."""
raise NotImplementedError(f"{type(self).__name__} must implement _parse()")
# --- Optional template methods ---
def _build_headers(self) -> dict:
"""Build extra request headers. Override for platform-specific headers.
Default: returns {} (no extra headers beyond HTTPClient defaults).
"""
return {}
def _check_blocked(self, status_code: int, body: str) -> bool:
"""Detect platform-specific anti-crawl blocks.
Override to inspect response body/status for block signals.
Default: returns False (assume not blocked).
"""
return False
# --- Orchestration ---
def _request(self, params: dict) -> tuple[int, Any]:
"""Execute a single HTTP request. Uses _build_headers() for extra headers."""
extra_headers = self._build_headers()
return self.http_client.get(
self.ENDPOINT, params=params, headers=extra_headers or None
)
def search(self, page_index: int = 1) -> Result:
"""Fetch a single page: build params → request → check blocked → parse."""
params = self._build_params(page_index)
http_code, raw = self._request(params)
raw_str = str(raw) if not isinstance(raw, str) else raw
if self._check_blocked(http_code, raw_str):
return Result(success=False, status_code=http_code, error="blocked")
return self._parse(http_code, raw)
def load_all(self, max_pages: int = 10, on_page=None) -> list:
"""Iterate pages until is_end_page=True or max_pages reached."""
all_items: list = []
for page_index in range(1, max_pages + 1):
result = self.search(page_index)
if not result.success:
_logger.warning("第 %d 页失败: %s", page_index, result.error)
break
all_items.extend(result.list)
if on_page:
on_page(page_index, result)
if result.is_end_page:
break
return all_items
crawler_core/init.py:
"""
crawler_core — 招聘爬虫共享核心包
安装方式: pip install -e ./crawler_core
使用方式: from crawler_core import BaseFetcher, BaseSearcher, Result, HTTPClient
"""
from crawler_core.base import Result, BaseFetcher, BaseSearcher, parse_response
from crawler_core.http_client import HTTPClient
__all__ = [
"Result",
"BaseFetcher",
"BaseSearcher",
"HTTPClient",
"parse_response",
]
__version__ = "0.1.0"
Do NOT:
- Keep the old
ApiResultname anywhere in crawler_core (it's fully replaced byResult[T]) - Import from
spiderJobs.*orapp.* - Import loguru
- Add any platform-specific code to base.py or init.py
cd /Users/win/2025/AICoding/JobData && python -c "
import sys
sys.path.insert(0, '.')
from crawler_core import BaseFetcher, BaseSearcher, Result, HTTPClient, parse_response
import dataclasses, typing
fields = {f.name for f in dataclasses.fields(Result)}
assert fields == {'success','status_code','data','list','count','is_end_page','error'}, f'Result fields wrong: {fields}'
assert hasattr(BaseFetcher, 'fetch'), 'BaseFetcher.fetch missing'
assert hasattr(BaseFetcher, '_build_headers'), 'BaseFetcher._build_headers missing'
assert hasattr(BaseFetcher, '_check_blocked'), 'BaseFetcher._check_blocked missing'
assert BaseFetcher._build_headers(object()) == {}, '_build_headers default must return {}'
assert BaseFetcher._check_blocked(object(), 200, '') == False, '_check_blocked default must return False'
assert hasattr(BaseSearcher, 'load_all'), 'BaseSearcher.load_all missing'
assert hasattr(BaseSearcher, '_build_headers'), 'BaseSearcher._build_headers missing'
assert hasattr(BaseSearcher, '_check_blocked'), 'BaseSearcher._check_blocked missing'
print('All imports OK, Result fields OK, 4 template methods verified')
"
<acceptance_criteria>
from crawler_core import BaseFetcher, BaseSearcher, Result, HTTPClientsucceeds (with repo root on sys.path)crawler_core/base.pydefinesResultas a generic dataclass usingTypeVarandGeneric[T]crawler_core/base.pydoes NOT containApiResultanywhere:grep "ApiResult" crawler_core/base.pyreturns emptycrawler_core/base.pydoes NOT containfrom spiderJobsanywhere:grep "from spiderJobs" crawler_core/base.pyreturns emptycrawler_core/base.pydoes NOT containprint(anywhere:grep "print(" crawler_core/base.pyreturns emptyBaseFetcher._build_headers(self)exists and returns{}by defaultBaseFetcher._check_blocked(self, status_code, body)exists and returnsFalseby defaultBaseFetcher.fetch()calls_build_headers()and_check_blocked()in its implementationBaseSearcher._build_headers(self)exists and returns{}by defaultBaseSearcher._check_blocked(self, status_code, body)exists and returnsFalseby defaultBaseSearcher.search()calls_check_blocked()in its implementationcrawler_core/__init__.pyexportsResult(notApiResult) in__all__crawler_core/__init__.pycontains__version__ = "0.1.0"Resultdataclass has exactly 7 fields: success, status_code, data, list, count, is_end_page, errorBaseFetcher._build_paramsraisesNotImplementedErrorBaseSearcher._build_paramsraisesNotImplementedError</acceptance_criteria> base.py uses Result[T] generic (no ApiResult), 4 template methods wired into fetch()/search(), init.py exports clean public API.
cd /Users/win/2025/AICoding/JobData
python -c "
import sys
sys.path.insert(0, '.')
from crawler_core import BaseFetcher, BaseSearcher, Result, HTTPClient, parse_response
# Verify Result structure
r = Result(success=True, status_code=200)
assert r.success and r.list == [] and r.error is None
# Verify BaseFetcher requires _build_params
class TestFetcher(BaseFetcher):
ENDPOINT = '/test'
def _build_params(self):
return {'q': 'test'}
def _parse(self, http_code, raw):
return Result(success=True, status_code=http_code)
# Verify default template method overrides
tf = TestFetcher(http_client=None)
assert tf._build_headers() == {}, '_build_headers default failed'
assert tf._check_blocked(200, '') == False, '_check_blocked default failed'
# Verify parse_response with dict input
result = parse_response(200, {'statusCode': 200, 'data': {'list': [{'id': 1}], 'count': 1, 'isEndPage': False}})
assert result.success
assert result.list == [{'id': 1}]
assert not result.is_end_page
print('All verification checks passed')
"
Also confirm no cross-contamination:
grep -r "from spiderJobs" /Users/win/2025/AICoding/JobData/crawler_core/ && echo "FAIL: found spiderJobs import" || echo "OK: no spiderJobs imports"
grep -r "from app" /Users/win/2025/AICoding/JobData/crawler_core/ && echo "FAIL: found app import" || echo "OK: no app imports"
grep -r "loguru" /Users/win/2025/AICoding/JobData/crawler_core/ && echo "FAIL: found loguru" || echo "OK: no loguru"
grep -r "ApiResult" /Users/win/2025/AICoding/JobData/crawler_core/ && echo "FAIL: ApiResult still present" || echo "OK: ApiResult fully replaced by Result[T]"
<success_criteria>
python -c "from crawler_core import BaseFetcher, BaseSearcher, Result, HTTPClient"exits 0 (with repo root on sys.path)crawler_core/pyproject.tomlpassespython -c "import tomllib; tomllib.load(open('crawler_core/pyproject.toml','rb'))"grep "requests_go" Pipfilehas output — dependency declaredgrep "tenacity" Pipfilehas output — dependency declaredgrep "pytest" Pipfilehas output — dev dependency declaredgrep -r "from spiderJobs" crawler_core/has NO outputgrep -r "loguru" crawler_core/has NO outputgrep "min=10" crawler_core/http_client.pyhas output — anti-detection delay preservedgrep -r "ApiResult" crawler_core/has NO output — fully replaced by Result[T]BaseFetcher._build_headersandBaseFetcher._check_blockedexist and are wired intofetch()spiderJobs/andjobs_spider/directories are UNCHANGED (no files modified) </success_criteria>