594 lines
15 KiB
Markdown
594 lines
15 KiB
Markdown
# 爬虫数据上报接口文档
|
||
|
||
> 适用版本:JobData v1.x
|
||
> 更新日期:2026-03-20
|
||
> 目标读者:接手爬虫开发或后端对接的工程师
|
||
|
||
---
|
||
|
||
## 目录
|
||
|
||
1. [整体架构](#1-整体架构)
|
||
2. [认证方式](#2-认证方式)
|
||
3. [核心上报接口](#3-核心上报接口)
|
||
- 3.1 [异步批量上报(推荐)](#31-异步批量上报推荐)
|
||
- 3.2 [同步批量上报](#32-同步批量上报)
|
||
- 3.3 [同步单条上报](#33-同步单条上报)
|
||
- 3.4 [平台专属便捷接口](#34-平台专属便捷接口)
|
||
4. [各平台数据结构](#4-各平台数据结构)
|
||
- 4.1 [BOSS直聘](#41-boss直聘-platformboss)
|
||
- 4.2 [前程无忧](#42-前程无忧-platformqcwy)
|
||
- 4.3 [智联招聘](#43-智联招聘-platformzhilian)
|
||
5. [去重规则](#5-去重规则)
|
||
6. [爬虫调用示例](#6-爬虫调用示例)
|
||
7. [辅助接口](#7-辅助接口)
|
||
8. [数据存储说明](#8-数据存储说明)
|
||
9. [常见问题](#9-常见问题)
|
||
|
||
---
|
||
|
||
## 1. 整体架构
|
||
|
||
```
|
||
爬虫(Boss / 前程无忧 / 智联)
|
||
│
|
||
│ POST /api/v1/universal/data/batch-store-async
|
||
▼
|
||
FastAPI 后端(app/)
|
||
│
|
||
├── 去重检查(ClickHouse 查最近 90 天)
|
||
├── 写入 ClickHouse(job_data 库)
|
||
└── 转发至外部数据平台(qixin.com)
|
||
```
|
||
|
||
三个平台的爬虫**调用同一套接口**,通过 `platform` 字段区分来源,通过 `data_type` 字段区分数据类型(职位/公司)。
|
||
|
||
---
|
||
|
||
## 2. 认证方式
|
||
|
||
数据上报接口属于**内部接口,无需鉴权**。
|
||
|
||
爬虫调用时统一在 Header 中携带:
|
||
|
||
```
|
||
token: dev
|
||
```
|
||
|
||
> 说明:`dev` 是开发模式 Token,后端不验证签名,直接放行。生产部署如需启用鉴权,改用 JWT Token(HS256,有效期 7 天)。
|
||
|
||
---
|
||
|
||
## 3. 核心上报接口
|
||
|
||
**Base URL**(本地开发):`http://localhost:8000`
|
||
|
||
两个路径前缀完全等价,行为相同:
|
||
- `/api/v1/universal`
|
||
- `/api/v1/job`
|
||
|
||
---
|
||
|
||
### 3.1 异步批量上报(推荐)
|
||
|
||
**三个平台爬虫均使用此接口**,立即返回 202,后台异步写入。
|
||
|
||
```
|
||
POST /api/v1/universal/data/batch-store-async
|
||
```
|
||
|
||
**Request Headers**
|
||
|
||
```
|
||
Content-Type: application/json
|
||
token: dev
|
||
```
|
||
|
||
**Request Body**
|
||
|
||
```json
|
||
{
|
||
"data_list": [
|
||
{ ...原始职位或公司 JSON... }
|
||
],
|
||
"data_type": "job",
|
||
"platform": "boss",
|
||
"check_duplicate": true
|
||
}
|
||
```
|
||
|
||
| 字段 | 类型 | 必填 | 说明 |
|
||
|------|------|:----:|------|
|
||
| `data_list` | `List[Dict]` | ✅ | 原始数据列表,结构见第 4 节 |
|
||
| `data_type` | `string` | ✅ | `job`(职位)或 `company`(公司) |
|
||
| `platform` | `string` | ✅ | `boss` / `qcwy` / `zhilian` |
|
||
| `check_duplicate` | `bool` | ❌ | 默认 `true`,`false` 时跳过去重直接写入 |
|
||
|
||
**Response(HTTP 202)**
|
||
|
||
```json
|
||
{
|
||
"code": 202,
|
||
"message": "批量数据已加入异步处理队列,共 10 条",
|
||
"platform": "boss",
|
||
"data_type": "job"
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### 3.2 同步批量上报
|
||
|
||
同步等待全部写入完成后返回,可获得详细的成功/失败统计。
|
||
|
||
```
|
||
POST /api/v1/universal/data/batch-store
|
||
```
|
||
|
||
**Request Body** — 与 3.1 相同。
|
||
|
||
**Response(HTTP 200)**
|
||
|
||
```json
|
||
{
|
||
"code": 200,
|
||
"message": "批量处理完成: 成功 8 条,失败 0 条,重复 2 条",
|
||
"data": {
|
||
"total": 10,
|
||
"success": 8,
|
||
"failed": 0,
|
||
"duplicate": 2,
|
||
"errors": []
|
||
},
|
||
"platform": "boss",
|
||
"data_type": "job"
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### 3.3 同步单条上报
|
||
|
||
```
|
||
POST /api/v1/universal/data/store
|
||
```
|
||
|
||
**Request Body**
|
||
|
||
```json
|
||
{
|
||
"data": { ...单条原始 JSON... },
|
||
"data_type": "job",
|
||
"platform": "boss",
|
||
"check_duplicate": true
|
||
}
|
||
```
|
||
|
||
注意:字段名是 `data`(单条),不是 `data_list`。
|
||
|
||
**Response(HTTP 200)**
|
||
|
||
```json
|
||
{
|
||
"code": 200,
|
||
"message": "JSON数据存储成功",
|
||
"data": {
|
||
"success": true,
|
||
"message": "JSON数据存储成功",
|
||
"duplicate": false,
|
||
"table_name": "boss_job",
|
||
"storage_type": "json"
|
||
},
|
||
"platform": "boss",
|
||
"data_type": "job"
|
||
}
|
||
```
|
||
|
||
数据重复时 `duplicate` 为 `true`,`message` 为 `"数据重复,跳过插入"`,HTTP 仍返回 200。
|
||
|
||
---
|
||
|
||
### 3.4 平台专属便捷接口
|
||
|
||
Request Body 直接传原始 JSON 对象(无需包装 `platform`/`data_type`),等价于 3.3 的 `data` 字段:
|
||
|
||
| URL | 平台 | 类型 |
|
||
|-----|------|------|
|
||
| `POST /api/v1/job/boss/job` | BOSS直聘 | 职位 |
|
||
| `POST /api/v1/job/boss/company` | BOSS直聘 | 公司 |
|
||
| `POST /api/v1/job/qcwy/job` | 前程无忧 | 职位 |
|
||
| `POST /api/v1/job/qcwy/company` | 前程无忧 | 公司 |
|
||
| `POST /api/v1/job/zhilian/job` | 智联招聘 | 职位 |
|
||
| `POST /api/v1/job/zhilian/company` | 智联招聘 | 公司 |
|
||
|
||
---
|
||
|
||
## 4. 各平台数据结构
|
||
|
||
> 以下为 `data_list` 中每个元素的结构,即各平台原始 API 响应体(直接透传,无需转换)。
|
||
|
||
---
|
||
|
||
### 4.1 BOSS直聘(platform=boss)
|
||
|
||
#### 职位(data_type=job)
|
||
|
||
数据来源:BOSS 微信小程序接口 `/wapi/zpgeek/miniapp/job/detail.json`
|
||
|
||
```json
|
||
{
|
||
"jobBaseInfoVO": {
|
||
"jobId": "123456",
|
||
"encryptJobId": "abc123",
|
||
"positionName": "Python 工程师",
|
||
"locationName": "上海",
|
||
"locationDesc": "上海市浦东新区XX路XX号",
|
||
"jobDesc": "负责数据采集与处理...",
|
||
"degreeName": "本科",
|
||
"experienceName": "3-5年",
|
||
"lowSalary": 15,
|
||
"highSalary": 25,
|
||
"requiredSkills": ["Python", "爬虫", "ClickHouse"],
|
||
"salaryWelfareInfo": ["五险一金", "弹性工作"]
|
||
},
|
||
"brandComInfoVO": {
|
||
"encryptBrandId": "brand_abc",
|
||
"brandName": "某科技有限公司",
|
||
"industryName": "互联网",
|
||
"scaleName": "100-499人",
|
||
"introduce": "公司简介..."
|
||
},
|
||
"bossBaseInfoVO": {
|
||
"brandName": "张HR"
|
||
}
|
||
}
|
||
```
|
||
|
||
**关键去重字段**:`jobBaseInfoVO.jobId`
|
||
|
||
#### 公司(data_type=company)
|
||
|
||
数据来源:BOSS 微信小程序接口 `/wapi/zpgeek/miniapp/brand/detail.json`
|
||
|
||
```json
|
||
{
|
||
"name": "某科技有限公司",
|
||
"companyFullInfoVO": {
|
||
"name": "某科技有限公司(全称)"
|
||
}
|
||
}
|
||
```
|
||
|
||
**关键去重字段**:`name` 或 `companyFullInfoVO.name`(取 company_name)
|
||
|
||
---
|
||
|
||
### 4.2 前程无忧(platform=qcwy)
|
||
|
||
#### 职位(data_type=job)
|
||
|
||
数据来源:前程无忧 APP 接口
|
||
|
||
```json
|
||
{
|
||
"jobId": "JL123456789",
|
||
"updateDateTime": "2026-03-20 10:00:00",
|
||
"jobName": "数据工程师",
|
||
"companyName": "某公司",
|
||
"fullCompanyName": "某公司全称有限公司",
|
||
"coId": "CO123456",
|
||
"jobDescribe": "岗位职责:...",
|
||
"degreeString": "本科",
|
||
"workYearString": "3-5年",
|
||
"jobSalaryMax": 20000,
|
||
"jobSalaryMin": 15000,
|
||
"provideSalaryString": "15k-20k",
|
||
"termStr": "全职",
|
||
"companySizeString": "500-999人",
|
||
"companyTypeString": "民营企业",
|
||
"major1Str": "互联网/电子商务",
|
||
"major2Str": "数据服务",
|
||
"jobWelfareCodeDataList": [
|
||
{ "chineseTitle": "五险一金", "typeTitle": "社保", "code": "001" }
|
||
],
|
||
"jobTagsForOrder": ["Python", "Spark", "Hive"],
|
||
"location": "上海",
|
||
"workLocation": {
|
||
"workAddress": "浦东新区",
|
||
"address": "上海市浦东新区XX路"
|
||
},
|
||
"jobAreaString": "上海",
|
||
"jobAreaLevelDetail": {
|
||
"cityString": "上海",
|
||
"landMarkString": "陆家嘴"
|
||
},
|
||
"confirmDateString": "2026-03-20",
|
||
"jobHref": "https://www.51job.com/...",
|
||
"companyHref": "https://www.51job.com/..."
|
||
}
|
||
```
|
||
|
||
**关键去重字段**:`jobId` + `updateDateTime`(两字段联合唯一)
|
||
|
||
#### 公司(data_type=company)
|
||
|
||
```json
|
||
{
|
||
"companyName": "某公司",
|
||
"fullCompanyName": "某公司全称有限公司"
|
||
}
|
||
```
|
||
|
||
**关键去重字段**:`companyName`
|
||
|
||
---
|
||
|
||
### 4.3 智联招聘(platform=zhilian)
|
||
|
||
#### 职位(data_type=job)
|
||
|
||
数据来源:智联招聘 PC 搜索接口 `https://fe-api.zhaopin.com/c/i/search/positions`
|
||
|
||
```json
|
||
{
|
||
"number": "ZL20260320001",
|
||
"firstPublishTime": "2026-03-20T10:00:00",
|
||
"name": "后端开发工程师",
|
||
"jobId": "J001",
|
||
"companyName": "某公司",
|
||
"companyId": "C001",
|
||
"salary60": "15k-25k",
|
||
"jobSummary": "职位描述:负责后端服务开发...",
|
||
"education": "本科",
|
||
"workingExp": "3-5年",
|
||
"workType": "全职",
|
||
"workCity": "上海",
|
||
"cityDistrict": "浦东新区",
|
||
"companySize": "500-999人",
|
||
"propertyName": "民营企业",
|
||
"industryName": "互联网",
|
||
"skillLabel": [
|
||
{ "value": "Go" },
|
||
{ "value": "Python" }
|
||
],
|
||
"recruitNumber": 3,
|
||
"positionURL": "https://www.zhaopin.com/...",
|
||
"companyUrl": "https://www.zhaopin.com/...",
|
||
"companyDesc": "公司描述(从额外接口补充)"
|
||
}
|
||
```
|
||
|
||
**关键去重字段**:`number` + `firstPublishTime`(两字段联合唯一)
|
||
|
||
#### 公司(data_type=company)
|
||
|
||
```json
|
||
{
|
||
"companyName": "某公司",
|
||
"name": "某公司"
|
||
}
|
||
```
|
||
|
||
**关键去重字段**:`companyName` 或 `name`
|
||
|
||
---
|
||
|
||
## 5. 去重规则
|
||
|
||
| 平台 | 数据类型 | 去重字段 | ClickHouse 表 |
|
||
|------|----------|----------|---------------|
|
||
| boss | job | `jobBaseInfoVO.jobId` | `boss_job` |
|
||
| boss | company | `name` / `companyFullInfoVO.name` | `boss_company` |
|
||
| qcwy | job | `jobId` + `updateDateTime` | `qcwy_job` |
|
||
| qcwy | company | `companyName` | `qcwy_company` |
|
||
| zhilian | job | `number` + `firstPublishTime` | `zhilian_job` |
|
||
| zhilian | company | `companyName` / `name` | `zhilian_company` |
|
||
|
||
- 去重检查范围:**最近 90 天**内已入库的记录。
|
||
- 重复数据不报错,正常返回 200,`duplicate: true`。
|
||
- 传 `check_duplicate: false` 可跳过去重,强制写入(测试时使用)。
|
||
|
||
---
|
||
|
||
## 6. 爬虫调用示例
|
||
|
||
### BOSS直聘(`jobs_spider/boss/boos_api.py`)
|
||
|
||
```python
|
||
import requests
|
||
|
||
API_BASE_URL = "http://localhost:8000"
|
||
|
||
def push_job(zp_data: dict):
|
||
"""推送职位数据"""
|
||
payload = {
|
||
"data_list": [zp_data],
|
||
"data_type": "job",
|
||
"platform": "boss"
|
||
}
|
||
resp = requests.post(
|
||
f"{API_BASE_URL}/api/v1/universal/data/batch-store-async",
|
||
headers={
|
||
"accept": "application/json",
|
||
"token": "dev",
|
||
"Content-Type": "application/json"
|
||
},
|
||
json=payload,
|
||
timeout=30
|
||
)
|
||
return resp.json()
|
||
|
||
|
||
def push_company(zp_data: dict):
|
||
"""推送公司数据"""
|
||
payload = {
|
||
"data_list": [zp_data],
|
||
"data_type": "company",
|
||
"platform": "boss"
|
||
}
|
||
resp = requests.post(
|
||
f"{API_BASE_URL}/api/v1/universal/data/batch-store-async",
|
||
headers={
|
||
"accept": "application/json",
|
||
"token": "dev",
|
||
"Content-Type": "application/json"
|
||
},
|
||
json=payload,
|
||
timeout=30
|
||
)
|
||
return resp.json()
|
||
```
|
||
|
||
---
|
||
|
||
### 前程无忧(`jobs_spider/qcwy/qcwy.py`)
|
||
|
||
```python
|
||
import requests
|
||
import socket
|
||
|
||
API_BASE_URL = "http://localhost:8000"
|
||
local_ip = socket.gethostbyname(socket.gethostname())
|
||
|
||
def report_data(data: list, data_type: str = "job"):
|
||
"""批量上报数据"""
|
||
payload = {
|
||
"data_list": data,
|
||
"data_type": data_type, # "job" 或 "company"
|
||
"platform": "qcwy"
|
||
}
|
||
resp = requests.post(
|
||
f"{API_BASE_URL}/api/v1/universal/data/batch-store-async",
|
||
json=payload,
|
||
headers={
|
||
"accept": "application/json",
|
||
"Content-Type": "application/json",
|
||
"X-Forwarded-For": local_ip # 传递真实 IP,用于日志溯源
|
||
},
|
||
timeout=300
|
||
)
|
||
return resp.json()
|
||
```
|
||
|
||
---
|
||
|
||
### 智联招聘(`jobs_spider/zhilian/zhilian_single.py`)
|
||
|
||
```python
|
||
import requests
|
||
|
||
API_BASE_URL = "http://localhost:8000"
|
||
|
||
def report_data(data_list: list, data_type: str = "job"):
|
||
"""批量上报数据"""
|
||
payload = {
|
||
"data_list": data_list,
|
||
"data_type": data_type, # "job" 或 "company"
|
||
"platform": "zhilian"
|
||
}
|
||
resp = requests.post(
|
||
f"{API_BASE_URL}/api/v1/universal/data/batch-store-async",
|
||
json=payload,
|
||
headers={
|
||
"accept": "application/json",
|
||
"Content-Type": "application/json"
|
||
},
|
||
timeout=300
|
||
)
|
||
return resp.json()
|
||
```
|
||
|
||
---
|
||
|
||
## 7. 辅助接口
|
||
|
||
爬虫运行过程中还会调用以下辅助接口:
|
||
|
||
| 接口 | 说明 | 主要使用方 |
|
||
|------|------|-----------|
|
||
| `GET /api/v1/token/tokens?page=1&page_size=10` | 获取可用的 MPT Token 列表 | BOSS爬虫 |
|
||
| `GET /api/v1/keyword/available?source=boss&limit=1&reserve=True` | 获取下一个未使用的关键词(城市+职位组合) | BOSS爬虫 |
|
||
| `POST /api/v1/keyword/mark-used` | 标记关键词已使用 | BOSS爬虫 |
|
||
| `GET /api/v1/stats` | 查询各平台已入库数据量 | 监控/运营 |
|
||
| `GET /api/v1/platforms` | 查询支持的平台列表及去重字段配置 | 调试 |
|
||
| `GET /api/v1/universal/data?platform=boss&data_type=job&page=1&page_size=20` | 分页查询已入库数据 | 调试 |
|
||
|
||
### 标记关键词已使用 Request Body
|
||
|
||
```json
|
||
{
|
||
"source": "boss",
|
||
"ids": [1, 2, 3]
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 8. 数据存储说明
|
||
|
||
### ClickHouse 表结构
|
||
|
||
所有表均在 `job_data` 数据库下,ENGINE = `MergeTree()`。
|
||
|
||
**通用列(每张表都有):**
|
||
|
||
| 列名 | 类型 | 说明 |
|
||
|------|------|------|
|
||
| `id` | UInt64 | 自增 ID |
|
||
| `json_data` | String | 原始 JSON 字符串(完整保存) |
|
||
| `created_at` | DateTime | 入库时间 |
|
||
| `updated_at` | DateTime | 更新时间 |
|
||
|
||
**各表额外列(用于去重查询):**
|
||
|
||
| 表名 | 额外列 |
|
||
|------|--------|
|
||
| `boss_job` | `job_id String` |
|
||
| `boss_company` | `company_name String` |
|
||
| `qcwy_job` | `job_id String`, `update_date_time String` |
|
||
| `qcwy_company` | `company_name String` |
|
||
| `zhilian_job` | `number String`, `first_publish_time String` |
|
||
| `zhilian_company` | `company_name String` |
|
||
|
||
### 统一查询视图
|
||
|
||
`job_analytics` 视图 UNION ALL 三张职位表,提供统一查询入口:
|
||
|
||
| 列名 | 说明 |
|
||
|------|------|
|
||
| `source` | 平台来源(boss/qcwy/zhilian) |
|
||
| `job_id` | 职位唯一标识 |
|
||
| `position_name` | 职位名称 |
|
||
| `company_name` | 公司名称 |
|
||
| `salary_text` | 薪资描述 |
|
||
| `city` | 城市 |
|
||
| `experience_required` | 经验要求 |
|
||
| `education` | 学历要求 |
|
||
| `created_at` | 入库时间 |
|
||
|
||
---
|
||
|
||
## 9. 常见问题
|
||
|
||
**Q:上报返回 202,但数据库里查不到数据?**
|
||
A:异步接口的写入有延迟(通常 1-5 秒)。改用同步接口 `batch-store` 可立即确认写入结果。
|
||
|
||
**Q:如何判断某条数据是否已存在?**
|
||
A:调用同步单条上报接口,响应中 `duplicate: true` 表示已存在。
|
||
|
||
**Q:`check_duplicate: false` 会导致重复数据吗?**
|
||
A:会。仅在测试/调试时使用,生产环境保持默认 `true`。
|
||
|
||
**Q:三个平台的数据结构差异大,如何统一分析?**
|
||
A:使用 `job_analytics` 视图,已将三张表的字段映射为统一列名。
|
||
|
||
**Q:爬虫报超时错误怎么处理?**
|
||
A:异步接口 timeout 建议设 30s,同步接口因要等待写入完成,建议设 300s。若仍超时,检查 ClickHouse 连接状态。
|
||
|
||
**Q:`token: dev` 在生产环境安全吗?**
|
||
A:不安全。生产环境应替换为 JWT Token,并在接口上挂载鉴权中间件。
|
||
|
||
---
|
||
|
||
*文档由 JobData 项目自动生成,如有疑问联系项目维护者。*
|