win
|
78eee99c2f
|
全部完成
|
2026-03-22 23:22:30 +08:00 |
|
win
|
45fba5697e
|
fix: 定时任务 company channel 的 job 也走 push_mapper + enrichment
company_cleaning_job → sync_company_jobs → store_batch(channel="company")
之前 channel="company" 的 job 配置没有 push_mapper,导致:
- 不会生成 push_data_list → 不调 push_to_remote
- 不触发 company_desc 补全
三个平台 channel="company" 配置加上对应的 push_mapper
|
2026-03-22 22:01:06 +08:00 |
|
win
|
24918a272b
|
feat: 爬虫优化 — company_desc 补全、Boss详情获取、URL修复
- 新增 company_enrichment.py: job 入库时自动补全 company_desc
(优先查 MySQL,fallback 调平台 API 获取并入库)
- Boss 爬虫: 搜索列表后逐条调 batch 详情接口拿完整数据
(jobBaseInfoVO/brandComInfoVO),每条获取后立即上报
- Boss push_mapper: 兼容新旧两种 API 格式(扁平/嵌套VO)
- Boss token: 启动时自动从后端 API 读取数据库中的 mpt/wt2
- Boss client: header 值 strip 防止空格导致请求失败
- qcwy URL: 用 jobId/coId 拼接 jobs.51job.com 格式
- 三个平台 max_pages 默认改为 100
|
2026-03-22 21:54:19 +08:00 |
|
win
|
6c8eb00a50
|
feat(06): quality & frontend (QUAL-02, QUAL-06)
Plan 01 - QUAL-02: 三平台解析函数单元测试:
- tests/ingest/test_configs_boss.py: 10 个测试
(_extract_job_id, _extract_company_name, _build_boss_push)
- tests/ingest/test_configs_qcwy.py: 12 个测试
(_extract_job_id, _extract_update_dt, _extract_company_name, _build_qcwy_push)
- tests/ingest/test_configs_zhilian.py: 12 个测试
(_extract_number, _extract_fpt, _extract_company_name, _build_zhilian_push)
Plan 02 - QUAL-06: 爬虫入库统计 API + 前端监控区域:
- job.py: GET /job/data/stats 端点(总量/今日/最近入库时间/近7天趋势)
- web/src/api/index.js: getIngestStats() 方法
- monitoring.vue: 新增爬虫职位入库统计区域(三平台卡片 + 趋势表格)
- job.py: Optional 导入修复
QUAL-07: 确认 monitor.vue 已有完整清洗队列功能,无需改动
Full regression: 146 passed (112 existing + 34 new)
|
2026-03-21 22:56:24 +08:00 |
|
win
|
3d202c3486
|
feat(05): data pipeline optimization (DATA-01, DATA-04)
Plan 01 - DATA-01: 30-day window dedup fix:
- dedup.py: both single-field and double-field SQL queries now include
AND created_at > now() - INTERVAL 30 DAY
- tests/ingest/test_dedup.py: 6 mock tests validating 30-day window
Plan 02 - DATA-04: company vs search job channel separation:
- schemas/ingest.py: ChannelType.COMPANY = 'company'
- configs/boss.py: register channel='company' config
- configs/qcwy.py: register channel='company' config
- configs/zhilian.py: register channel='company' config
- company_jobs_sync.py: store_batch(..., 'mini', ...) → (..., 'company', ...)
DATA-02: confirmed already complete (job.py has /data/batch-async endpoint)
DATA-03: confirmed already complete (company_cleaner.py full pipeline)
Full regression: 112 passed (106 existing + 6 new)
|
2026-03-21 19:50:06 +08:00 |
|
win
|
2b94f15b56
|
fix(04): correct architecture — private files use crawler_core directly
Architecture clarification from user: spiderJobs/ is standalone execution,
NOT meant to be imported by app/. Correct dependency graph:
crawler_core ← shared base library
↑ ↑
spiderJobs app/services/crawler/
(standalone) (FastAPI backend, private layer)
Changes:
- boss.py/qcwy.py/zhilian.py: revert import back to private _boss_api etc.
- _boss/job51/zhilian_api.py: use crawler_core.base.Result/BaseFetcher/BaseSearcher
+ fix self._http → self.http_client
- _boss/job51/zhilian_client.py: use crawler_core.http_client.HTTPClient
+ _boss_client uses crawler_core.boss.sign.BossSign directly
- _boss/job51/zhilian_sign.py: backward-compat stubs → crawler_core.*.sign
Full regression: 106 passed in 0.68s
|
2026-03-21 19:39:30 +08:00 |
|
win
|
3aadbd128b
|
feat(04): migrate facade to spiderJobs.platforms.* + asyncio bridge; delete jobs_spider/
Plan 01 - facade migration (ARCH-06/07):
- boss.py: import from spiderJobs.platforms.boss.{api,client,sign}
- qcwy.py: import from spiderJobs.platforms.job51.{api,client}
- zhilian.py: import from spiderJobs.platforms.zhilian.{api,client,sign}
- All 3 Service classes: +4 async_* methods via asyncio.to_thread()
Plan 02 - deprecation + cleanup (ARCH-08):
- 11 private copy files (_base, _http_client, _boss/job51/zhilian *): DEPRECATED header
- jobs_spider/ directory: fully deleted (user request)
Full regression: 106 passed in 0.61s
|
2026-03-21 19:36:24 +08:00 |
|
zfc
|
3d7e96845d
|
up
|
2026-01-24 17:07:34 +08:00 |
|
|
|
59bfefff0e
|
feat: 优化公司数据去重逻辑,扩大检查范围到90天
|
2026-01-14 22:14:33 +08:00 |
|