win
|
42280f8bed
|
ip
|
2026-03-24 01:49:31 +08:00 |
|
win
|
20cb35fc6e
|
fix: 移除 push_to_remote 的 print(data),避免日志撑爆磁盘
每条 job 含 base_data 几十KB,全量 print 到 stdout 导致
Docker json-file 日志无限增长(930GB+)
|
2026-03-23 10:58:41 +08:00 |
|
win
|
78eee99c2f
|
全部完成
|
2026-03-22 23:22:30 +08:00 |
|
win
|
45fba5697e
|
fix: 定时任务 company channel 的 job 也走 push_mapper + enrichment
company_cleaning_job → sync_company_jobs → store_batch(channel="company")
之前 channel="company" 的 job 配置没有 push_mapper,导致:
- 不会生成 push_data_list → 不调 push_to_remote
- 不触发 company_desc 补全
三个平台 channel="company" 配置加上对应的 push_mapper
|
2026-03-22 22:01:06 +08:00 |
|
win
|
24918a272b
|
feat: 爬虫优化 — company_desc 补全、Boss详情获取、URL修复
- 新增 company_enrichment.py: job 入库时自动补全 company_desc
(优先查 MySQL,fallback 调平台 API 获取并入库)
- Boss 爬虫: 搜索列表后逐条调 batch 详情接口拿完整数据
(jobBaseInfoVO/brandComInfoVO),每条获取后立即上报
- Boss push_mapper: 兼容新旧两种 API 格式(扁平/嵌套VO)
- Boss token: 启动时自动从后端 API 读取数据库中的 mpt/wt2
- Boss client: header 值 strip 防止空格导致请求失败
- qcwy URL: 用 jobId/coId 拼接 jobs.51job.com 格式
- 三个平台 max_pages 默认改为 100
|
2026-03-22 21:54:19 +08:00 |
|
win
|
3d202c3486
|
feat(05): data pipeline optimization (DATA-01, DATA-04)
Plan 01 - DATA-01: 30-day window dedup fix:
- dedup.py: both single-field and double-field SQL queries now include
AND created_at > now() - INTERVAL 30 DAY
- tests/ingest/test_dedup.py: 6 mock tests validating 30-day window
Plan 02 - DATA-04: company vs search job channel separation:
- schemas/ingest.py: ChannelType.COMPANY = 'company'
- configs/boss.py: register channel='company' config
- configs/qcwy.py: register channel='company' config
- configs/zhilian.py: register channel='company' config
- company_jobs_sync.py: store_batch(..., 'mini', ...) → (..., 'company', ...)
DATA-02: confirmed already complete (job.py has /data/batch-async endpoint)
DATA-03: confirmed already complete (company_cleaner.py full pipeline)
Full regression: 112 passed (106 existing + 6 new)
|
2026-03-21 19:50:06 +08:00 |
|
win
|
2b94f15b56
|
fix(04): correct architecture — private files use crawler_core directly
Architecture clarification from user: spiderJobs/ is standalone execution,
NOT meant to be imported by app/. Correct dependency graph:
crawler_core ← shared base library
↑ ↑
spiderJobs app/services/crawler/
(standalone) (FastAPI backend, private layer)
Changes:
- boss.py/qcwy.py/zhilian.py: revert import back to private _boss_api etc.
- _boss/job51/zhilian_api.py: use crawler_core.base.Result/BaseFetcher/BaseSearcher
+ fix self._http → self.http_client
- _boss/job51/zhilian_client.py: use crawler_core.http_client.HTTPClient
+ _boss_client uses crawler_core.boss.sign.BossSign directly
- _boss/job51/zhilian_sign.py: backward-compat stubs → crawler_core.*.sign
Full regression: 106 passed in 0.68s
|
2026-03-21 19:39:30 +08:00 |
|
win
|
3aadbd128b
|
feat(04): migrate facade to spiderJobs.platforms.* + asyncio bridge; delete jobs_spider/
Plan 01 - facade migration (ARCH-06/07):
- boss.py: import from spiderJobs.platforms.boss.{api,client,sign}
- qcwy.py: import from spiderJobs.platforms.job51.{api,client}
- zhilian.py: import from spiderJobs.platforms.zhilian.{api,client,sign}
- All 3 Service classes: +4 async_* methods via asyncio.to_thread()
Plan 02 - deprecation + cleanup (ARCH-08):
- 11 private copy files (_base, _http_client, _boss/job51/zhilian *): DEPRECATED header
- jobs_spider/ directory: fully deleted (user request)
Full regression: 106 passed in 0.61s
|
2026-03-21 19:36:24 +08:00 |
|
|
|
59bfefff0e
|
feat: 优化公司数据去重逻辑,扩大检查范围到90天
|
2026-01-14 22:14:33 +08:00 |
|