- ARCH-04: job51 migrated to crawler_core (no old deps) - ARCH-05: zhilian migrated to crawler_core (no old deps) - 34 new mock tests (17 job51 + 17 zhilian) - Added _parse_zhilian_response custom parser for zhilian API format - Fixed POST Searcher _request() overrides for job51/zhilian - Full regression: 98 passed in 0.12s
16 KiB
Codebase Concerns
Analysis Date: 2026-03-21
Security Considerations
CRITICAL: Hardcoded Credentials in Source Code
Area: Configuration
- Risk: Real database passwords, SMTP credentials, and IP addresses committed to version control.
- Files:
app/settings/config.py - Exposed values (lines 27–52):
- MySQL root password (
jobdata123) with public IP121.4.126.241:3306 - ClickHouse user/pass defaults (
data_user/data_pass) with same public IP - QQ SMTP app password (
jkncedjzkummbbdi) and recipient email addresses
- MySQL root password (
- Current mitigation:
pydantic-settingsallows environment variable override, but defaults are live credentials that will be exposed if the repo is shared or CI logs are leaked. - Recommendations: Remove all real default values; replace with
Noneor clearly fake placeholders; enforce required-field validation at startup; rotate exposed credentials immediately.
CRITICAL: Unauthenticated Data Ingest Endpoints
Area: API Authentication
- Risk: Any actor with network access can push arbitrary data directly into ClickHouse.
- Files:
app/api/v1/__init__.py(lines 32–34),app/api/v1/job/job.py - The
/ingest,/job, and/universalrouter groups are mounted without anyDependPermissionorDependAuthdependency. The data store, batch-store, and batch-store-async endpoints accept unauthenticated POST requests. - Trigger: Send a POST to
/api/v1/universal/data/batch-store-asyncwith any payload. No token required. - Workaround: The CLAUDE.md acknowledges these as "internal call" endpoints, but they are publicly reachable with no network-level isolation noted.
- Recommendations: Add at minimum IP allowlist middleware or a static shared secret header check for ingest routes; or mount them behind
DependAuth.
HIGH: Dev Token Bypass in Production Auth
Area: Authentication
- Risk: If
APP_ENV=development, any request withtoken: devheader is granted the first user's full privileges. - Files:
app/core/dependency.py(lines 28–30) - The check
os.getenv("APP_ENV", "production") == "development"defaults to"production", but there is no enforcement that this variable is set correctly in production deployments. A misconfigured environment would expose the bypass. - Recommendations: Remove the dev bypass entirely; use test fixtures instead; add a startup assertion that the bypass is disabled in non-dev environments.
HIGH: SQL Injection via f-string Queries
Area: ClickHouse Queries
- Risk: User-controlled values inserted directly into query strings allow injection.
- Files:
reclean_qcwy_jobs.py(line 34):f"SELECT json_data FROM job_data.qcwy_job WHERE job_id = '{job_id}' LIMIT 1"—job_idcomes from URL parsing of user-provided input.app/repositories/clickhouse_repo.py(lines 101–108):group_by_columnparameter is interpolated directly into the SELECT and GROUP BY clauses without validation.app/services/ingest/service.py(lines 109, 112):tablevalue is derived from config (lower risk) but still concatenated into f-strings.app/core/scheduler.py(lines 57, 63):tablenames from a fixed list, but pattern is risky.
- Current mitigation: ClickHouse parameterized queries are used in
dedup.py, but inconsistently applied elsewhere. - Recommendations: Use parameterized queries (
parameters={}) for all user-influenced values; validategroup_by_columnagainst an allowlist; never interpolate user input into SQL strings.
MEDIUM: CORS Misconfiguration
Area: CORS Policy
- Risk:
CORS_ALLOW_CREDENTIALS = FalsewhileCORS_ALLOW_HEADERS = ["*"]is set. If credentials are later enabled, wildcard headers become a security liability. - Files:
app/settings/config.py(lines 13–16) - Recommendations: Restrict
CORS_ORIGINSto known production domains; avoid wildcard headers.
MEDIUM: Hardcoded Public IP in Spider Default
Area: Spider Configuration
- Risk: Hardcoded IP
124.222.106.226:9999as the defaultAPI_BASE_URLin all spider scripts andspiderJobs/runner. If that IP changes hands or is exposed, all crawled data goes to an unintended recipient. - Files:
jobs_spider/boss/boos_api.py(line 17)jobs_spider/qcwy/qcwy.py(line 28)jobs_spider/zhilian/zhilian_single.py(line 41)spiderJobs/runner/api_client.py(line 31)
- Recommendations: Remove hardcoded IP defaults; require
API_BASE_URLto be explicitly set via environment variable with no default.
Tech Debt
Duplicate Spider Implementations
Area: Code Organization
- Issue: Two parallel spider hierarchies exist for the same three platforms.
jobs_spider/boss/,jobs_spider/qcwy/,jobs_spider/zhilian/— original scriptsspiderJobs/platforms/boss/,spiderJobs/platforms/job51/,spiderJobs/platforms/zhilian/— refactored versionapp/services/crawler/— yet another crawler abstraction inside the app
- Files:
jobs_spider/,spiderJobs/,app/services/crawler/ - Impact: Bug fixes and anti-detection improvements must be applied in multiple places; the relationship between these directories is not documented.
- Fix approach: Decide on one canonical implementation; deprecate and delete the others; update CLAUDE.md to clarify which is the active spider.
God File: jobs_spider/boss/boos_api.py
Area: File Size / Complexity
- Issue: Single file containing 2281 lines covering API client, IP strategy, anomaly detection, session management, main loop, and push logic.
- Files:
jobs_spider/boss/boos_api.py - Impact: Extremely difficult to test or modify any single responsibility; cognitive load is very high.
- Fix approach: Extract
SmartIPManager,IPAnomalyDetector,IPStrategyConfiginto separate modules; separate the main crawl loop from the API client class.
CleaningService and CompanyCleaner Code Duplication
Area: Service Layer
- Issue: Both classes replicate identical token-loading logic (
_ensure_boss_token_loaded), proxy-setting logic (_apply_proxy), and service instantiation (BossService, QcwyService, ZhilianService). - Files:
app/services/cleaning.py(lines 37–48, 31–34)app/services/company_cleaner.py(lines 60–74, 54–58)
- Impact: Token refresh logic is maintained in two places; a bug fix in one will not propagate to the other.
- Fix approach: Extract a shared
CrawlerContextorCrawlerMixinclass providing token management and proxy configuration.
ClickHouse Schema Changes Require Manual DDL
Area: Database Migrations
- Issue: ClickHouse table definitions in
app/core/clickhouse_init.pyuseCREATE TABLE IF NOT EXISTS, meaning schema changes (new columns, index changes) are never automatically applied to existing tables. There is no migration tooling equivalent to Aerich for ClickHouse. - Files:
app/core/clickhouse_init.py - Impact: If a column is added to the DDL, existing production tables silently retain the old schema; queries referencing new columns will fail at runtime.
- Fix approach: Implement
ALTER TABLE ... ADD COLUMN IF NOT EXISTSstatements for schema evolution; or adopt a ClickHouse migration tool (e.g.,ch-migrations).
Startup File Lock Not Robust to Crashes
Area: Multi-Worker Initialization
- Issue:
init_data()usesos.mkdir(".startup_lock")as a bare file lock with no TTL. If a worker crashes mid-initialization, the lock directory is never cleaned up, and subsequent restarts will skip all migrations and seed data. - Files:
app/core/init_app.py(lines 322–349) - Impact: Silent data-initialization failure on restart after crash; can result in missing admin user or broken menu state.
- Fix approach: Replace with the existing
DistributedLockclass fromapp/core/locks.pywhich has TTL support; or at minimum add a time-based stale-lock cleanup.
Performance Bottlenecks
ClickHouse Singleton Connection Per-Worker
Area: Database Connections
- Problem:
clickhouse_managerinapp/core/clickhouse.pyis a module-level singleton that lazily creates one connection. With 20 uvicorn workers (UVICORN_WORKERS=20), each worker maintains its own connection, and the connection pool inside each is limited tomaxsize=5. Under concurrent load, requests block waiting for the pool. - Files:
app/core/clickhouse.py(lines 31–58) - Cause: A new
PoolManageris created on everyget_clickhouse_client()call but the singleton pattern meansget_client()only calls this once per worker — pool configuration cannot be tuned per endpoint. - Improvement path: Use a connection pool library or connection factory pattern; expose pool size as a configurable setting; consider
clickhouse-poolfor proper async pooling.
Dedup Query Scans Entire Table
Area: ClickHouse Deduplication
- Problem:
batch_dedup_filterinapp/services/ingest/dedup.pyissuesSELECT key_col FROM table WHERE key_col IN (...)with no time-bound filter. For large tables this is a full-table scan on each ingest batch. - Files:
app/services/ingest/dedup.py(lines 51, 65) - Cause: No
PREWHERE created_at > now() - INTERVAL N DAYconstraint added. - Improvement path: Add a configurable recency window to dedup queries (e.g., 30-day lookback matching the existing
days_backlogic incompany_cleaner.py).
Synchronous Crawler Calls Blocking Async Workers
Area: Concurrency
- Problem: All crawler service methods (BossService, QcwyService, ZhilianService) use synchronous
requestsinternally, wrapped withasyncio.to_thread(). Under high concurrency, this can exhaust the thread pool. - Files:
app/services/cleaning.py(lines 186, 190, 194, etc.),app/services/company_cleaner.py(lines 318, 321, 325) - Cause: The underlying crawler libraries use
requests(sync), nothttpx(async). - Improvement path: Migrate crawlers to
httpx.AsyncClientfor true async I/O, or cap thread pool size viaasyncio.to_threadsemaphore.
Fragile Areas
group_by_column API Parameter Lacks Allowlist Validation
Area: Analytics API
- Files:
app/repositories/clickhouse_repo.py(lines 86–115),app/api/v1/analytics.py - Why fragile:
group_by_columnis passed directly into the ClickHouse query without validation against allowed column names. An invalid column name causes a runtime ClickHouse error that propagates as a 500 to the client; a malicious value enables injection. - Safe modification: Add a
ALLOWED_GROUP_COLUMNS = frozenset({"source", "city", "channel", ...})check before query construction. - Test coverage: No tests for analytics query building.
DistributedLock File Lock Has Race Window
Area: Distributed Locking
- Files:
app/core/locks.py(lines 64–92) - Why fragile: The stale-lock cleanup path (lines 81–87) involves
shutil.rmtreefollowed bymkdir— a TOCTOU race where two workers can both detect staleness, both delete the directory, and both attempt to re-create it. One wins, one raisesFileExistsErrorand falls through to returnFalse, but the winner's lock is not guaranteed to be the intended holder. - Safe modification: Use
os.renamefrom a temp directory as an atomic lock-acquire primitive. - Test coverage: None.
CleaningService Returns Mixed Types (bool | Dict)
Area: Type Consistency
- Files:
app/services/cleaning.py(lines 158–165, 167–202, 204–239, 241–268) - Why fragile: Multiple
clean_*methods returnUnion[bool, Dict[str, Any]]. Callers must defensively checkisinstance(result, bool)before accessing dict keys (seeprocess_single_itemat lines 128–135). This makes the API fragile — any new call site that forgets theboolcheck will raiseAttributeError. - Safe modification: Define a
CleanResultdataclass or TypedDict; always return it. - Test coverage: None.
company_cleaning_job Assumes 5-Minute Completion
Area: Scheduler
- Files:
app/core/scheduler.py(lines 171–202) - Why fragile: The job comment explicitly notes it processes 50 collect + 30 process operations and "should complete in 2–3 minutes" to fit within the 5-minute schedule window. If the external crawlers slow down (rate-limited, network issues), the job overruns its window, causing the next scheduled run to be blocked by the distributed lock, leading to queue backlog.
- Safe modification: Add timeout guards around
collect_pending_companiesandprocess_pending_companies; or dynamically reduce batch sizes based on elapsed time.
Missing Critical Features
No Rate Limiting on Any API Endpoint
Problem: FastAPI has no rate-limiting middleware configured. The data ingest endpoints (which have no auth) and the cleaning trigger endpoints can be called at arbitrary rates.
- Blocks: Protection against accidental or malicious flooding of ClickHouse with writes.
- Recommendation: Add
slowapior custom middleware with per-IP rate limiting on ingest and crawl-triggering endpoints.
No Input Size Limit on Batch Ingest
Problem: IngestBatchRequest.data_list has no max_items constraint. A single request with tens of thousands of job records will be processed synchronously (or queued as a background task) with no guard on memory consumption.
- Files:
app/schemas/ingest.py,app/api/v1/job/job.py - Recommendation: Add
Field(max_length=1000)or equivalent todata_list; add body size limit in uvicorn/nginx.
Test Coverage Gaps
Service Layer: Virtually Untested
- What's not tested:
CleaningService,CompanyCleaner,IngestService,AnalyticsService,DistributedLock,ClickHouseManager - Files:
app/services/cleaning.py,app/services/company_cleaner.py,app/services/ingest/service.py,app/services/analytics_service.py,app/core/locks.py,app/core/clickhouse.py - Risk: Core data routing, deduplication, and cleaning logic can regress silently. The existing tests only cover two pure-function helpers (
extract_company_fields,normalize_company_id,_extract_*_jobs). - Priority: High
API Layer: No Integration Tests
- What's not tested: All FastAPI route handlers, auth flow, permission enforcement, error responses.
- Files:
app/api/v1/(all modules) - Risk: Auth bypass regressions, schema changes breaking clients, and incorrect HTTP status codes go undetected.
- Priority: High
Spider Logic: No Unit Tests
- What's not tested:
IPAnomalyDetector.detect(),SmartIPManagerstate transitions, response parsing inBossZhipinAPI,QcwyApi,ZhilianApi. - Files:
jobs_spider/boss/boos_api.py,jobs_spider/qcwy/qcwy.py,jobs_spider/zhilian/zhilian_single.py - Risk: Anti-detection logic breaks silently when platforms change their response format.
- Priority: Medium
ClickHouse Schema: No Schema Evolution Tests
- What's not tested: Behavior of
clickhouse_init.pywhen tables already exist with old schemas; column migration paths. - Files:
app/core/clickhouse_init.py - Risk: Production schema drift goes unnoticed until a query fails at runtime.
- Priority: Medium
Known Issues / Observed Bugs
Silent Error Swallowing in Spider Code
- Symptoms: Data silently not collected; no error surfaced to caller.
- Files:
jobs_spider/qcwy/qcwy.py(lines 178–179, 191–192, 1047–1048, 1146–1147),jobs_spider/qcwy/search_company_jobs.py(lines 240–241, 512–513, 538–539, 601–602),jobs_spider/boss/boos_api.py(lines 400, 407–408, 508–509, 518–519, 1779–1780),jobs_spider/zhilian/company_spider.py(manyexcept: passblocks) - Trigger: Network errors, parsing errors, or unexpected response shapes in spider code are caught by bare
except Exception: passblocks. - Workaround: Check logs for INFO-level skips; monitor push success counts.
_send_email Uses Blocking SMTP in Async Context
- Symptoms: Scheduler event loop blocked for up to several seconds during email send.
- Files:
app/core/scheduler.py(lines 308–336) - Trigger: Any
stats_joborip_alert_jobexecution that sends email calls synchronoussmtplib.SMTPfrom the async event loop without wrapping inasyncio.to_thread. - Workaround: Email delivery may be delayed; event loop latency spikes during scheduled runs.
cleanup_old_records Deletes All Done/Failed Records Without Archive
- Symptoms: Historical processing data is permanently deleted daily; no audit trail of what was cleaned.
- Files:
app/services/company_cleaner.py(lines 329–331),app/core/scheduler.py(line 220) - Trigger:
daily_cleanup_jobcallsCompanyCleaningQueue.filter(status__in=["done", "failed"]).delete()— a hard delete with no archiving or retention window.
Concerns audit: 2026-03-21