win 00a727519f docs(phase-3): complete execution — 2/2 plans, 98 tests passing
- ARCH-04: job51 migrated to crawler_core (no old deps)
- ARCH-05: zhilian migrated to crawler_core (no old deps)
- 34 new mock tests (17 job51 + 17 zhilian)
- Added _parse_zhilian_response custom parser for zhilian API format
- Fixed POST Searcher _request() overrides for job51/zhilian
- Full regression: 98 passed in 0.12s
2026-03-21 19:19:17 +08:00

256 lines
16 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Codebase Concerns
**Analysis Date:** 2026-03-21
---
## Security Considerations
### CRITICAL: Hardcoded Credentials in Source Code
**Area:** Configuration
- Risk: Real database passwords, SMTP credentials, and IP addresses committed to version control.
- Files: `app/settings/config.py`
- Exposed values (lines 2752):
- MySQL root password (`jobdata123`) with public IP `121.4.126.241:3306`
- ClickHouse user/pass defaults (`data_user` / `data_pass`) with same public IP
- QQ SMTP app password (`jkncedjzkummbbdi`) and recipient email addresses
- Current mitigation: `pydantic-settings` allows environment variable override, but defaults are live credentials that will be exposed if the repo is shared or CI logs are leaked.
- Recommendations: Remove all real default values; replace with `None` or clearly fake placeholders; enforce required-field validation at startup; rotate exposed credentials immediately.
### CRITICAL: Unauthenticated Data Ingest Endpoints
**Area:** API Authentication
- Risk: Any actor with network access can push arbitrary data directly into ClickHouse.
- Files: `app/api/v1/__init__.py` (lines 3234), `app/api/v1/job/job.py`
- The `/ingest`, `/job`, and `/universal` router groups are mounted **without** any `DependPermission` or `DependAuth` dependency. The data store, batch-store, and batch-store-async endpoints accept unauthenticated POST requests.
- Trigger: Send a POST to `/api/v1/universal/data/batch-store-async` with any payload. No token required.
- Workaround: The CLAUDE.md acknowledges these as "internal call" endpoints, but they are publicly reachable with no network-level isolation noted.
- Recommendations: Add at minimum IP allowlist middleware or a static shared secret header check for ingest routes; or mount them behind `DependAuth`.
### HIGH: Dev Token Bypass in Production Auth
**Area:** Authentication
- Risk: If `APP_ENV=development`, any request with `token: dev` header is granted the first user's full privileges.
- Files: `app/core/dependency.py` (lines 2830)
- The check `os.getenv("APP_ENV", "production") == "development"` defaults to `"production"`, but there is no enforcement that this variable is set correctly in production deployments. A misconfigured environment would expose the bypass.
- Recommendations: Remove the dev bypass entirely; use test fixtures instead; add a startup assertion that the bypass is disabled in non-dev environments.
### HIGH: SQL Injection via f-string Queries
**Area:** ClickHouse Queries
- Risk: User-controlled values inserted directly into query strings allow injection.
- Files:
- `reclean_qcwy_jobs.py` (line 34): `f"SELECT json_data FROM job_data.qcwy_job WHERE job_id = '{job_id}' LIMIT 1"``job_id` comes from URL parsing of user-provided input.
- `app/repositories/clickhouse_repo.py` (lines 101108): `group_by_column` parameter is interpolated directly into the SELECT and GROUP BY clauses without validation.
- `app/services/ingest/service.py` (lines 109, 112): `table` value is derived from config (lower risk) but still concatenated into f-strings.
- `app/core/scheduler.py` (lines 57, 63): `table` names from a fixed list, but pattern is risky.
- Current mitigation: ClickHouse parameterized queries are used in `dedup.py`, but inconsistently applied elsewhere.
- Recommendations: Use parameterized queries (`parameters={}`) for all user-influenced values; validate `group_by_column` against an allowlist; never interpolate user input into SQL strings.
### MEDIUM: CORS Misconfiguration
**Area:** CORS Policy
- Risk: `CORS_ALLOW_CREDENTIALS = False` while `CORS_ALLOW_HEADERS = ["*"]` is set. If credentials are later enabled, wildcard headers become a security liability.
- Files: `app/settings/config.py` (lines 1316)
- Recommendations: Restrict `CORS_ORIGINS` to known production domains; avoid wildcard headers.
### MEDIUM: Hardcoded Public IP in Spider Default
**Area:** Spider Configuration
- Risk: Hardcoded IP `124.222.106.226:9999` as the default `API_BASE_URL` in all spider scripts and `spiderJobs/` runner. If that IP changes hands or is exposed, all crawled data goes to an unintended recipient.
- Files:
- `jobs_spider/boss/boos_api.py` (line 17)
- `jobs_spider/qcwy/qcwy.py` (line 28)
- `jobs_spider/zhilian/zhilian_single.py` (line 41)
- `spiderJobs/runner/api_client.py` (line 31)
- Recommendations: Remove hardcoded IP defaults; require `API_BASE_URL` to be explicitly set via environment variable with no default.
---
## Tech Debt
### Duplicate Spider Implementations
**Area:** Code Organization
- Issue: Two parallel spider hierarchies exist for the same three platforms.
- `jobs_spider/boss/`, `jobs_spider/qcwy/`, `jobs_spider/zhilian/` — original scripts
- `spiderJobs/platforms/boss/`, `spiderJobs/platforms/job51/`, `spiderJobs/platforms/zhilian/` — refactored version
- `app/services/crawler/` — yet another crawler abstraction inside the app
- Files: `jobs_spider/`, `spiderJobs/`, `app/services/crawler/`
- Impact: Bug fixes and anti-detection improvements must be applied in multiple places; the relationship between these directories is not documented.
- Fix approach: Decide on one canonical implementation; deprecate and delete the others; update CLAUDE.md to clarify which is the active spider.
### God File: `jobs_spider/boss/boos_api.py`
**Area:** File Size / Complexity
- Issue: Single file containing 2281 lines covering API client, IP strategy, anomaly detection, session management, main loop, and push logic.
- Files: `jobs_spider/boss/boos_api.py`
- Impact: Extremely difficult to test or modify any single responsibility; cognitive load is very high.
- Fix approach: Extract `SmartIPManager`, `IPAnomalyDetector`, `IPStrategyConfig` into separate modules; separate the main crawl loop from the API client class.
### `CleaningService` and `CompanyCleaner` Code Duplication
**Area:** Service Layer
- Issue: Both classes replicate identical token-loading logic (`_ensure_boss_token_loaded`), proxy-setting logic (`_apply_proxy`), and service instantiation (BossService, QcwyService, ZhilianService).
- Files:
- `app/services/cleaning.py` (lines 3748, 3134)
- `app/services/company_cleaner.py` (lines 6074, 5458)
- Impact: Token refresh logic is maintained in two places; a bug fix in one will not propagate to the other.
- Fix approach: Extract a shared `CrawlerContext` or `CrawlerMixin` class providing token management and proxy configuration.
### ClickHouse Schema Changes Require Manual DDL
**Area:** Database Migrations
- Issue: ClickHouse table definitions in `app/core/clickhouse_init.py` use `CREATE TABLE IF NOT EXISTS`, meaning schema changes (new columns, index changes) are never automatically applied to existing tables. There is no migration tooling equivalent to Aerich for ClickHouse.
- Files: `app/core/clickhouse_init.py`
- Impact: If a column is added to the DDL, existing production tables silently retain the old schema; queries referencing new columns will fail at runtime.
- Fix approach: Implement `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` statements for schema evolution; or adopt a ClickHouse migration tool (e.g., `ch-migrations`).
### Startup File Lock Not Robust to Crashes
**Area:** Multi-Worker Initialization
- Issue: `init_data()` uses `os.mkdir(".startup_lock")` as a bare file lock with no TTL. If a worker crashes mid-initialization, the lock directory is never cleaned up, and subsequent restarts will skip all migrations and seed data.
- Files: `app/core/init_app.py` (lines 322349)
- Impact: Silent data-initialization failure on restart after crash; can result in missing admin user or broken menu state.
- Fix approach: Replace with the existing `DistributedLock` class from `app/core/locks.py` which has TTL support; or at minimum add a time-based stale-lock cleanup.
---
## Performance Bottlenecks
### ClickHouse Singleton Connection Per-Worker
**Area:** Database Connections
- Problem: `clickhouse_manager` in `app/core/clickhouse.py` is a module-level singleton that lazily creates one connection. With 20 uvicorn workers (`UVICORN_WORKERS=20`), each worker maintains its own connection, and the connection pool inside each is limited to `maxsize=5`. Under concurrent load, requests block waiting for the pool.
- Files: `app/core/clickhouse.py` (lines 3158)
- Cause: A new `PoolManager` is created on every `get_clickhouse_client()` call but the singleton pattern means `get_client()` only calls this once per worker — pool configuration cannot be tuned per endpoint.
- Improvement path: Use a connection pool library or connection factory pattern; expose pool size as a configurable setting; consider `clickhouse-pool` for proper async pooling.
### Dedup Query Scans Entire Table
**Area:** ClickHouse Deduplication
- Problem: `batch_dedup_filter` in `app/services/ingest/dedup.py` issues `SELECT key_col FROM table WHERE key_col IN (...)` with no time-bound filter. For large tables this is a full-table scan on each ingest batch.
- Files: `app/services/ingest/dedup.py` (lines 51, 65)
- Cause: No `PREWHERE created_at > now() - INTERVAL N DAY` constraint added.
- Improvement path: Add a configurable recency window to dedup queries (e.g., 30-day lookback matching the existing `days_back` logic in `company_cleaner.py`).
### Synchronous Crawler Calls Blocking Async Workers
**Area:** Concurrency
- Problem: All crawler service methods (BossService, QcwyService, ZhilianService) use synchronous `requests` internally, wrapped with `asyncio.to_thread()`. Under high concurrency, this can exhaust the thread pool.
- Files: `app/services/cleaning.py` (lines 186, 190, 194, etc.), `app/services/company_cleaner.py` (lines 318, 321, 325)
- Cause: The underlying crawler libraries use `requests` (sync), not `httpx` (async).
- Improvement path: Migrate crawlers to `httpx.AsyncClient` for true async I/O, or cap thread pool size via `asyncio.to_thread` semaphore.
---
## Fragile Areas
### `group_by_column` API Parameter Lacks Allowlist Validation
**Area:** Analytics API
- Files: `app/repositories/clickhouse_repo.py` (lines 86115), `app/api/v1/analytics.py`
- Why fragile: `group_by_column` is passed directly into the ClickHouse query without validation against allowed column names. An invalid column name causes a runtime ClickHouse error that propagates as a 500 to the client; a malicious value enables injection.
- Safe modification: Add a `ALLOWED_GROUP_COLUMNS = frozenset({"source", "city", "channel", ...})` check before query construction.
- Test coverage: No tests for analytics query building.
### `DistributedLock` File Lock Has Race Window
**Area:** Distributed Locking
- Files: `app/core/locks.py` (lines 6492)
- Why fragile: The stale-lock cleanup path (lines 8187) involves `shutil.rmtree` followed by `mkdir` — a TOCTOU race where two workers can both detect staleness, both delete the directory, and both attempt to re-create it. One wins, one raises `FileExistsError` and falls through to return `False`, but the winner's lock is not guaranteed to be the intended holder.
- Safe modification: Use `os.rename` from a temp directory as an atomic lock-acquire primitive.
- Test coverage: None.
### `CleaningService` Returns Mixed Types (`bool | Dict`)
**Area:** Type Consistency
- Files: `app/services/cleaning.py` (lines 158165, 167202, 204239, 241268)
- Why fragile: Multiple `clean_*` methods return `Union[bool, Dict[str, Any]]`. Callers must defensively check `isinstance(result, bool)` before accessing dict keys (see `process_single_item` at lines 128135). This makes the API fragile — any new call site that forgets the `bool` check will raise `AttributeError`.
- Safe modification: Define a `CleanResult` dataclass or TypedDict; always return it.
- Test coverage: None.
### `company_cleaning_job` Assumes 5-Minute Completion
**Area:** Scheduler
- Files: `app/core/scheduler.py` (lines 171202)
- Why fragile: The job comment explicitly notes it processes 50 collect + 30 process operations and "should complete in 23 minutes" to fit within the 5-minute schedule window. If the external crawlers slow down (rate-limited, network issues), the job overruns its window, causing the next scheduled run to be blocked by the distributed lock, leading to queue backlog.
- Safe modification: Add timeout guards around `collect_pending_companies` and `process_pending_companies`; or dynamically reduce batch sizes based on elapsed time.
---
## Missing Critical Features
### No Rate Limiting on Any API Endpoint
**Problem:** FastAPI has no rate-limiting middleware configured. The data ingest endpoints (which have no auth) and the cleaning trigger endpoints can be called at arbitrary rates.
- Blocks: Protection against accidental or malicious flooding of ClickHouse with writes.
- Recommendation: Add `slowapi` or custom middleware with per-IP rate limiting on ingest and crawl-triggering endpoints.
### No Input Size Limit on Batch Ingest
**Problem:** `IngestBatchRequest.data_list` has no `max_items` constraint. A single request with tens of thousands of job records will be processed synchronously (or queued as a background task) with no guard on memory consumption.
- Files: `app/schemas/ingest.py`, `app/api/v1/job/job.py`
- Recommendation: Add `Field(max_length=1000)` or equivalent to `data_list`; add body size limit in uvicorn/nginx.
---
## Test Coverage Gaps
### Service Layer: Virtually Untested
- What's not tested: `CleaningService`, `CompanyCleaner`, `IngestService`, `AnalyticsService`, `DistributedLock`, `ClickHouseManager`
- Files: `app/services/cleaning.py`, `app/services/company_cleaner.py`, `app/services/ingest/service.py`, `app/services/analytics_service.py`, `app/core/locks.py`, `app/core/clickhouse.py`
- Risk: Core data routing, deduplication, and cleaning logic can regress silently. The existing tests only cover two pure-function helpers (`extract_company_fields`, `normalize_company_id`, `_extract_*_jobs`).
- Priority: **High**
### API Layer: No Integration Tests
- What's not tested: All FastAPI route handlers, auth flow, permission enforcement, error responses.
- Files: `app/api/v1/` (all modules)
- Risk: Auth bypass regressions, schema changes breaking clients, and incorrect HTTP status codes go undetected.
- Priority: **High**
### Spider Logic: No Unit Tests
- What's not tested: `IPAnomalyDetector.detect()`, `SmartIPManager` state transitions, response parsing in `BossZhipinAPI`, `QcwyApi`, `ZhilianApi`.
- Files: `jobs_spider/boss/boos_api.py`, `jobs_spider/qcwy/qcwy.py`, `jobs_spider/zhilian/zhilian_single.py`
- Risk: Anti-detection logic breaks silently when platforms change their response format.
- Priority: **Medium**
### ClickHouse Schema: No Schema Evolution Tests
- What's not tested: Behavior of `clickhouse_init.py` when tables already exist with old schemas; column migration paths.
- Files: `app/core/clickhouse_init.py`
- Risk: Production schema drift goes unnoticed until a query fails at runtime.
- Priority: **Medium**
---
## Known Issues / Observed Bugs
### Silent Error Swallowing in Spider Code
- Symptoms: Data silently not collected; no error surfaced to caller.
- Files: `jobs_spider/qcwy/qcwy.py` (lines 178179, 191192, 10471048, 11461147), `jobs_spider/qcwy/search_company_jobs.py` (lines 240241, 512513, 538539, 601602), `jobs_spider/boss/boos_api.py` (lines 400, 407408, 508509, 518519, 17791780), `jobs_spider/zhilian/company_spider.py` (many `except: pass` blocks)
- Trigger: Network errors, parsing errors, or unexpected response shapes in spider code are caught by bare `except Exception: pass` blocks.
- Workaround: Check logs for INFO-level skips; monitor push success counts.
### `_send_email` Uses Blocking SMTP in Async Context
- Symptoms: Scheduler event loop blocked for up to several seconds during email send.
- Files: `app/core/scheduler.py` (lines 308336)
- Trigger: Any `stats_job` or `ip_alert_job` execution that sends email calls synchronous `smtplib.SMTP` from the async event loop without wrapping in `asyncio.to_thread`.
- Workaround: Email delivery may be delayed; event loop latency spikes during scheduled runs.
### `cleanup_old_records` Deletes All Done/Failed Records Without Archive
- Symptoms: Historical processing data is permanently deleted daily; no audit trail of what was cleaned.
- Files: `app/services/company_cleaner.py` (lines 329331), `app/core/scheduler.py` (line 220)
- Trigger: `daily_cleanup_job` calls `CompanyCleaningQueue.filter(status__in=["done", "failed"]).delete()` — a hard delete with no archiving or retention window.
---
*Concerns audit: 2026-03-21*