41 Commits

Author SHA1 Message Date
win
b0a755a443 fix: MySQL 连接池配置优化,防止定时任务连接断开
- config.py: 从 DSN 字符串改为 dict 配置,加入 minsize/maxsize/connect_timeout
- scheduler.py: company_cleaning_job 执行前先 SELECT 1 检查连接可用性
2026-03-23 10:55:13 +08:00
win
78eee99c2f 全部完成 2026-03-22 23:22:30 +08:00
win
45fba5697e fix: 定时任务 company channel 的 job 也走 push_mapper + enrichment
company_cleaning_job → sync_company_jobs → store_batch(channel="company")
之前 channel="company" 的 job 配置没有 push_mapper,导致:
- 不会生成 push_data_list → 不调 push_to_remote
- 不触发 company_desc 补全

三个平台 channel="company" 配置加上对应的 push_mapper
2026-03-22 22:01:06 +08:00
win
24918a272b feat: 爬虫优化 — company_desc 补全、Boss详情获取、URL修复
- 新增 company_enrichment.py: job 入库时自动补全 company_desc
  (优先查 MySQL,fallback 调平台 API 获取并入库)
- Boss 爬虫: 搜索列表后逐条调 batch 详情接口拿完整数据
  (jobBaseInfoVO/brandComInfoVO),每条获取后立即上报
- Boss push_mapper: 兼容新旧两种 API 格式(扁平/嵌套VO)
- Boss token: 启动时自动从后端 API 读取数据库中的 mpt/wt2
- Boss client: header 值 strip 防止空格导致请求失败
- qcwy URL: 用 jobId/coId 拼接 jobs.51job.com 格式
- 三个平台 max_pages 默认改为 100
2026-03-22 21:54:19 +08:00
win
a177516c23 docs(phase-6): complete — QUAL-02 QUAL-06 QUAL-07 done, 146/146 tests 2026-03-21 22:56:38 +08:00
win
6c8eb00a50 feat(06): quality & frontend (QUAL-02, QUAL-06)
Plan 01 - QUAL-02: 三平台解析函数单元测试:
- tests/ingest/test_configs_boss.py: 10 个测试
  (_extract_job_id, _extract_company_name, _build_boss_push)
- tests/ingest/test_configs_qcwy.py: 12 个测试
  (_extract_job_id, _extract_update_dt, _extract_company_name, _build_qcwy_push)
- tests/ingest/test_configs_zhilian.py: 12 个测试
  (_extract_number, _extract_fpt, _extract_company_name, _build_zhilian_push)

Plan 02 - QUAL-06: 爬虫入库统计 API + 前端监控区域:
- job.py: GET /job/data/stats 端点(总量/今日/最近入库时间/近7天趋势)
- web/src/api/index.js: getIngestStats() 方法
- monitoring.vue: 新增爬虫职位入库统计区域(三平台卡片 + 趋势表格)
- job.py: Optional 导入修复

QUAL-07: 确认 monitor.vue 已有完整清洗队列功能,无需改动

Full regression: 146 passed (112 existing + 34 new)
2026-03-21 22:56:24 +08:00
win
c58c7ee5c2 docs(phase-6): add research and 2 plans for quality and frontend 2026-03-21 21:27:11 +08:00
win
6f9d4df3e2 docs(06): UI design contract 2026-03-21 21:25:01 +08:00
win
9756df6044 docs(phase-5): complete execution — 2/2 plans done
Summary files and roadmap/STATE updated.
Phase 5 complete: DATA-01 (30-day dedup window) + DATA-04 (company channel)
DATA-02 + DATA-03 confirmed already complete.
2026-03-21 19:50:41 +08:00
win
3d202c3486 feat(05): data pipeline optimization (DATA-01, DATA-04)
Plan 01 - DATA-01: 30-day window dedup fix:
- dedup.py: both single-field and double-field SQL queries now include
  AND created_at > now() - INTERVAL 30 DAY
- tests/ingest/test_dedup.py: 6 mock tests validating 30-day window

Plan 02 - DATA-04: company vs search job channel separation:
- schemas/ingest.py: ChannelType.COMPANY = 'company'
- configs/boss.py: register channel='company' config
- configs/qcwy.py: register channel='company' config
- configs/zhilian.py: register channel='company' config
- company_jobs_sync.py: store_batch(..., 'mini', ...) → (..., 'company', ...)

DATA-02: confirmed already complete (job.py has /data/batch-async endpoint)
DATA-03: confirmed already complete (company_cleaner.py full pipeline)

Full regression: 112 passed (106 existing + 6 new)
2026-03-21 19:50:06 +08:00
win
9ef31cc87e docs(phase-5): add research and 2 plans for data pipeline optimization 2026-03-21 19:44:56 +08:00
win
6a2f0bfb58 docs(phase-4): complete execution — 2/2 plans, architecture corrected
- Plan 01: facade uses private _boss/job51/zhilian_api/client files
  Private files now depend on crawler_core directly (not spiderJobs)
  Added asyncio.to_thread async_* methods for ARCH-06
- Plan 02: 11 private files marked DEPRECATED, jobs_spider/ deleted
- Architecture: facade→private→crawler_core; spiderJobs→crawler_core (independent)
- Full regression: 106 passed
2026-03-21 19:40:03 +08:00
win
2b94f15b56 fix(04): correct architecture — private files use crawler_core directly
Architecture clarification from user: spiderJobs/ is standalone execution,
NOT meant to be imported by app/. Correct dependency graph:

  crawler_core   ← shared base library
    ↑        ↑
spiderJobs  app/services/crawler/
(standalone) (FastAPI backend, private layer)

Changes:
- boss.py/qcwy.py/zhilian.py: revert import back to private _boss_api etc.
- _boss/job51/zhilian_api.py: use crawler_core.base.Result/BaseFetcher/BaseSearcher
  + fix self._http → self.http_client
- _boss/job51/zhilian_client.py: use crawler_core.http_client.HTTPClient
  + _boss_client uses crawler_core.boss.sign.BossSign directly
- _boss/job51/zhilian_sign.py: backward-compat stubs → crawler_core.*.sign

Full regression: 106 passed in 0.68s
2026-03-21 19:39:30 +08:00
win
3aadbd128b feat(04): migrate facade to spiderJobs.platforms.* + asyncio bridge; delete jobs_spider/
Plan 01 - facade migration (ARCH-06/07):
- boss.py: import from spiderJobs.platforms.boss.{api,client,sign}
- qcwy.py: import from spiderJobs.platforms.job51.{api,client}
- zhilian.py: import from spiderJobs.platforms.zhilian.{api,client,sign}
- All 3 Service classes: +4 async_* methods via asyncio.to_thread()

Plan 02 - deprecation + cleanup (ARCH-08):
- 11 private copy files (_base, _http_client, _boss/job51/zhilian *): DEPRECATED header
- jobs_spider/ directory: fully deleted (user request)

Full regression: 106 passed in 0.61s
2026-03-21 19:36:24 +08:00
win
2e11edcef8 docs(phase-4): add research and 2 plans for backend facade migration 2026-03-21 19:26:01 +08:00
win
00a727519f docs(phase-3): complete execution — 2/2 plans, 98 tests passing
- ARCH-04: job51 migrated to crawler_core (no old deps)
- ARCH-05: zhilian migrated to crawler_core (no old deps)
- 34 new mock tests (17 job51 + 17 zhilian)
- Added _parse_zhilian_response custom parser for zhilian API format
- Fixed POST Searcher _request() overrides for job51/zhilian
- Full regression: 98 passed in 0.12s
2026-03-21 19:19:17 +08:00
win
8c2c2d29d7 feat(03): migrate job51+zhilian to crawler_core (ARCH-04/05)
job51 (spiderJobs/platforms/job51/):
- client.py: HTTPClient+Job51Sign from crawler_core
- api.py: ApiResult→Result, self._http→self.http_client, _request() POST overrides
- main.py: BaseFetcher/BaseSearcher from crawler_core
- sign.py: backward-compatible stub re-exporting crawler_core.qcwy.sign.Job51Sign

zhilian (spiderJobs/platforms/zhilian/):
- client.py: HTTPClient+ZhilianSign from crawler_core
- api.py: add _parse_zhilian_response (HTTP 200=success), add _parse()/_request()
  to all classes (GET fetchers + POST searcher overrides)
- main.py: BaseFetcher/BaseSearcher from crawler_core
- sign.py: backward-compatible stub re-exporting crawler_core.zhilian.sign.ZhilianSign

tests: 34 new mock tests (17 job51 + 17 zhilian)
Full regression: 98 passed (job51:17 + zhilian:17 + boss:22 + crawler_core:41 + 1)
2026-03-21 19:18:22 +08:00
win
024c2bcd49 docs(phase-3): add research and 2 plans for job51+zhilian migration 2026-03-21 19:10:59 +08:00
win
f6913ffdde docs(phase-2): complete phase execution — 2/2 plans done, verification passed
- ARCH-03: Boss crawler migrated to crawler_core (no inline signatures or HTTP boilerplate)
- QUAL-03: 22 mock tests pass covering all Boss API classes
- Anti-crawl mechanisms preserved (TLS fingerprint, proxy rotation, 10s delay)
- Phase 1 regression: 41 tests still passing
2026-03-21 19:04:55 +08:00
win
5bd44774b9 test(02-02): add Boss HTTP layer mock tests (QUAL-03)
- 22 tests covering SearchRecJobs, GetBrandDetail, SearchBrandJobs, GetJobDetail, BossClient
- Uses MagicMock (requests_go not compatible with respx)
- Covers success responses, HTTP errors, biz errors, Traceid injection
- All 22 tests pass (0.08s)
2026-03-21 19:03:01 +08:00
win
919ed9f799 docs(02-01): add plan execution summary 2026-03-21 19:00:58 +08:00
win
46883cef8a feat(02-01): migrate Boss spider layer from spiderJobs.core to crawler_core
- client.py: inherit crawler_core.http_client.HTTPClient, use crawler_core.boss.sign.BossSign
- api.py: use crawler_core.base.Result/BaseFetcher/BaseSearcher, fix self._http -> self.http_client
- main.py: import BaseFetcher/BaseSearcher and BossSign from crawler_core
- sign.py: replace with backward-compat stub re-exporting BossSign from crawler_core
Satisfies ARCH-03
2026-03-21 19:00:30 +08:00
win
b20f77fa19 docs(phase-2): add research and 2 plans for Boss crawler migration 2026-03-21 18:48:58 +08:00
win
76085ac403 docs(01-02): complete sign algorithms plan — SUMMARY, STATE, ROADMAP updates
- Create 01-02-SUMMARY.md with implementation details, test counts, deviation docs
- STATE.md: mark plan 02 complete, add decisions from plan 02, update position
- ROADMAP.md: mark phase 1 plans 2/2 complete, update progress table
- REQUIREMENTS.md: mark QUAL-01 complete (41 sign algorithm unit tests)
2026-03-21 18:27:35 +08:00
win
333a6d155e test(01-02): write sign algorithm unit tests for crawler_core
- Add tests/crawler_core/test_boss_sign.py: 13 tests for BossSign, _compute_checksum, _generate_uuid
- Add tests/crawler_core/test_qcwy_sign.py: 10 tests for Job51Sign and SIGN_KEY
- Add tests/crawler_core/test_zhilian_sign.py: 13 tests for ZhilianSign
- Add conftest.py at project root to add project root to sys.path
- Update pyproject.toml with [tool.pytest.ini_options] pythonpath config
- Fix crawler_core/__init__.py: wrap heavy-dep imports in try/except so sign subpackages are importable in lightweight envs without requests_go installed
- Remove tests/crawler_core/__init__.py to prevent namespace shadowing of crawler_core package
2026-03-21 18:20:43 +08:00
win
d7c8bec287 docs(01-01): complete crawler_core package plan — SUMMARY, STATE, ROADMAP updates
- Create 01-01-SUMMARY.md with implementation details and interface contracts
- STATE.md: advance to plan 2, record metrics, add decisions from plan 01
- ROADMAP.md: update phase 1 plan progress (1/2 plans complete)
- REQUIREMENTS.md: mark ARCH-01, ARCH-02, QUAL-04, QUAL-05 complete
- crawler_core/__init__.py: preserve linter-added try/except ImportError guard
2026-03-21 18:14:19 +08:00
win
04d6303da2 feat(01-01): create crawler_core/base.py with Result[T] and crawler_core/__init__.py
- Define generic Result[T] dataclass (7 fields: success, status_code, data, list, count, is_end_page, error)
- Port parse_response() from spiderJobs/core/base.py returning Result[Any]
- BaseFetcher: 4 template methods (_build_params, _parse required; _build_headers, _check_blocked optional)
- BaseSearcher: 4 template methods with load_all() paginator using stdlib logging
- crawler_core/__init__.py exports BaseFetcher, BaseSearcher, Result, HTTPClient, parse_response
- No ApiResult, no loguru, no spiderJobs/app imports
2026-03-21 18:10:40 +08:00
win
ceb359d535 feat(01-01): create crawler_core/http_client.py with tenacity retry and stdlib logging
- Port HTTPClient from spiderJobs/core/http_client.py
- Add tenacity @retry decorator on post() and get() (3 attempts, min=10s wait)
- Use stdlib logging.getLogger('crawler_core.http_client') — no loguru
- No imports from spiderJobs.* or app.*
- TLS fingerprint and proxy logic preserved unchanged
2026-03-21 18:08:59 +08:00
win
bd1e50e410 feat(01-02): port sign algorithms to crawler_core/ platform directories
- Add crawler_core/boss/sign.py: BossSign traceid generator (pure stdlib)
- Add crawler_core/qcwy/sign.py: Job51Sign HMAC-SHA256 signing (pure stdlib)
- Add crawler_core/zhilian/sign.py: ZhilianSign header/param signing (pure stdlib)
- Add __init__.py for all three crawler_core platform directories
- Updated module docstrings to reference crawler_core; all logic unchanged
- No imports from spiderJobs or app; no HTTP dependencies
2026-03-21 18:08:53 +08:00
win
4932177f7c feat(01-01): create crawler_core package scaffold and pyproject.toml
- Create crawler_core/pyproject.toml with setuptools build config
- Add platform namespace __init__.py files for boss, qcwy, zhilian
- Add requests_go==1.0.9 and tenacity>=8.0 to Pipfile [packages]
- Add pytest, pytest-cov, pytest-anyio to Pipfile [dev-packages]
2026-03-21 18:07:54 +08:00
win
fe9a6d1403 docs(phase-1): create plans (2 plans, 2 waves) with checker revision 2026-03-21 17:53:13 +08:00
win
b27686a409 docs(01-shared-core): create phase 1 plans for crawler_core shared package
Plan 01-01 (Wave 1): Package scaffold with HTTPClient + tenacity retry (min=10s)
+ stdlib logging + BaseFetcher/BaseSearcher base classes + pyproject.toml.
Covers ARCH-01, ARCH-02, QUAL-04, QUAL-05.

Plan 01-02 (Wave 2): Sign algorithm migration (Boss/Job51/Zhilian) to
crawler_core/ + comprehensive unit tests — no HTTP, no mocks, pure functions.
Covers QUAL-01. 24+ test cases across 3 test files.

ROADMAP updated: Phase 1 now shows 2 concrete plans instead of TBD.
2026-03-21 17:45:14 +08:00
win
81b9305568 docs: gather Phase 1 context (shared core package) 2026-03-21 17:08:26 +08:00
win
44b5f390aa docs: create roadmap (6 phases) 2026-03-21 17:00:12 +08:00
win
5e9102148a docs: define v1 requirements 2026-03-21 16:39:05 +08:00
win
f3005ef525 docs: add research findings (stack, features, architecture, pitfalls, summary) 2026-03-21 16:36:37 +08:00
win
030da1ce53 chore: add project config 2026-03-21 16:19:53 +08:00
win
9166c4f7bc docs: initialize project 2026-03-21 16:17:12 +08:00
zfc
3d7e96845d up 2026-01-24 17:07:34 +08:00
duxin
7285475eb5 add time.sleep > 10 2026-01-20 15:42:47 +08:00
59bfefff0e feat: 优化公司数据去重逻辑,扩大检查范围到90天 2026-01-14 22:14:33 +08:00