- ARCH-04: job51 migrated to crawler_core (no old deps) - ARCH-05: zhilian migrated to crawler_core (no old deps) - 34 new mock tests (17 job51 + 17 zhilian) - Added _parse_zhilian_response custom parser for zhilian API format - Fixed POST Searcher _request() overrides for job51/zhilian - Full regression: 98 passed in 0.12s
256 lines
16 KiB
Markdown
256 lines
16 KiB
Markdown
# Codebase Concerns
|
||
|
||
**Analysis Date:** 2026-03-21
|
||
|
||
---
|
||
|
||
## Security Considerations
|
||
|
||
### CRITICAL: Hardcoded Credentials in Source Code
|
||
|
||
**Area:** Configuration
|
||
- Risk: Real database passwords, SMTP credentials, and IP addresses committed to version control.
|
||
- Files: `app/settings/config.py`
|
||
- Exposed values (lines 27–52):
|
||
- MySQL root password (`jobdata123`) with public IP `121.4.126.241:3306`
|
||
- ClickHouse user/pass defaults (`data_user` / `data_pass`) with same public IP
|
||
- QQ SMTP app password (`jkncedjzkummbbdi`) and recipient email addresses
|
||
- Current mitigation: `pydantic-settings` allows environment variable override, but defaults are live credentials that will be exposed if the repo is shared or CI logs are leaked.
|
||
- Recommendations: Remove all real default values; replace with `None` or clearly fake placeholders; enforce required-field validation at startup; rotate exposed credentials immediately.
|
||
|
||
### CRITICAL: Unauthenticated Data Ingest Endpoints
|
||
|
||
**Area:** API Authentication
|
||
- Risk: Any actor with network access can push arbitrary data directly into ClickHouse.
|
||
- Files: `app/api/v1/__init__.py` (lines 32–34), `app/api/v1/job/job.py`
|
||
- The `/ingest`, `/job`, and `/universal` router groups are mounted **without** any `DependPermission` or `DependAuth` dependency. The data store, batch-store, and batch-store-async endpoints accept unauthenticated POST requests.
|
||
- Trigger: Send a POST to `/api/v1/universal/data/batch-store-async` with any payload. No token required.
|
||
- Workaround: The CLAUDE.md acknowledges these as "internal call" endpoints, but they are publicly reachable with no network-level isolation noted.
|
||
- Recommendations: Add at minimum IP allowlist middleware or a static shared secret header check for ingest routes; or mount them behind `DependAuth`.
|
||
|
||
### HIGH: Dev Token Bypass in Production Auth
|
||
|
||
**Area:** Authentication
|
||
- Risk: If `APP_ENV=development`, any request with `token: dev` header is granted the first user's full privileges.
|
||
- Files: `app/core/dependency.py` (lines 28–30)
|
||
- The check `os.getenv("APP_ENV", "production") == "development"` defaults to `"production"`, but there is no enforcement that this variable is set correctly in production deployments. A misconfigured environment would expose the bypass.
|
||
- Recommendations: Remove the dev bypass entirely; use test fixtures instead; add a startup assertion that the bypass is disabled in non-dev environments.
|
||
|
||
### HIGH: SQL Injection via f-string Queries
|
||
|
||
**Area:** ClickHouse Queries
|
||
- Risk: User-controlled values inserted directly into query strings allow injection.
|
||
- Files:
|
||
- `reclean_qcwy_jobs.py` (line 34): `f"SELECT json_data FROM job_data.qcwy_job WHERE job_id = '{job_id}' LIMIT 1"` — `job_id` comes from URL parsing of user-provided input.
|
||
- `app/repositories/clickhouse_repo.py` (lines 101–108): `group_by_column` parameter is interpolated directly into the SELECT and GROUP BY clauses without validation.
|
||
- `app/services/ingest/service.py` (lines 109, 112): `table` value is derived from config (lower risk) but still concatenated into f-strings.
|
||
- `app/core/scheduler.py` (lines 57, 63): `table` names from a fixed list, but pattern is risky.
|
||
- Current mitigation: ClickHouse parameterized queries are used in `dedup.py`, but inconsistently applied elsewhere.
|
||
- Recommendations: Use parameterized queries (`parameters={}`) for all user-influenced values; validate `group_by_column` against an allowlist; never interpolate user input into SQL strings.
|
||
|
||
### MEDIUM: CORS Misconfiguration
|
||
|
||
**Area:** CORS Policy
|
||
- Risk: `CORS_ALLOW_CREDENTIALS = False` while `CORS_ALLOW_HEADERS = ["*"]` is set. If credentials are later enabled, wildcard headers become a security liability.
|
||
- Files: `app/settings/config.py` (lines 13–16)
|
||
- Recommendations: Restrict `CORS_ORIGINS` to known production domains; avoid wildcard headers.
|
||
|
||
### MEDIUM: Hardcoded Public IP in Spider Default
|
||
|
||
**Area:** Spider Configuration
|
||
- Risk: Hardcoded IP `124.222.106.226:9999` as the default `API_BASE_URL` in all spider scripts and `spiderJobs/` runner. If that IP changes hands or is exposed, all crawled data goes to an unintended recipient.
|
||
- Files:
|
||
- `jobs_spider/boss/boos_api.py` (line 17)
|
||
- `jobs_spider/qcwy/qcwy.py` (line 28)
|
||
- `jobs_spider/zhilian/zhilian_single.py` (line 41)
|
||
- `spiderJobs/runner/api_client.py` (line 31)
|
||
- Recommendations: Remove hardcoded IP defaults; require `API_BASE_URL` to be explicitly set via environment variable with no default.
|
||
|
||
---
|
||
|
||
## Tech Debt
|
||
|
||
### Duplicate Spider Implementations
|
||
|
||
**Area:** Code Organization
|
||
- Issue: Two parallel spider hierarchies exist for the same three platforms.
|
||
- `jobs_spider/boss/`, `jobs_spider/qcwy/`, `jobs_spider/zhilian/` — original scripts
|
||
- `spiderJobs/platforms/boss/`, `spiderJobs/platforms/job51/`, `spiderJobs/platforms/zhilian/` — refactored version
|
||
- `app/services/crawler/` — yet another crawler abstraction inside the app
|
||
- Files: `jobs_spider/`, `spiderJobs/`, `app/services/crawler/`
|
||
- Impact: Bug fixes and anti-detection improvements must be applied in multiple places; the relationship between these directories is not documented.
|
||
- Fix approach: Decide on one canonical implementation; deprecate and delete the others; update CLAUDE.md to clarify which is the active spider.
|
||
|
||
### God File: `jobs_spider/boss/boos_api.py`
|
||
|
||
**Area:** File Size / Complexity
|
||
- Issue: Single file containing 2281 lines covering API client, IP strategy, anomaly detection, session management, main loop, and push logic.
|
||
- Files: `jobs_spider/boss/boos_api.py`
|
||
- Impact: Extremely difficult to test or modify any single responsibility; cognitive load is very high.
|
||
- Fix approach: Extract `SmartIPManager`, `IPAnomalyDetector`, `IPStrategyConfig` into separate modules; separate the main crawl loop from the API client class.
|
||
|
||
### `CleaningService` and `CompanyCleaner` Code Duplication
|
||
|
||
**Area:** Service Layer
|
||
- Issue: Both classes replicate identical token-loading logic (`_ensure_boss_token_loaded`), proxy-setting logic (`_apply_proxy`), and service instantiation (BossService, QcwyService, ZhilianService).
|
||
- Files:
|
||
- `app/services/cleaning.py` (lines 37–48, 31–34)
|
||
- `app/services/company_cleaner.py` (lines 60–74, 54–58)
|
||
- Impact: Token refresh logic is maintained in two places; a bug fix in one will not propagate to the other.
|
||
- Fix approach: Extract a shared `CrawlerContext` or `CrawlerMixin` class providing token management and proxy configuration.
|
||
|
||
### ClickHouse Schema Changes Require Manual DDL
|
||
|
||
**Area:** Database Migrations
|
||
- Issue: ClickHouse table definitions in `app/core/clickhouse_init.py` use `CREATE TABLE IF NOT EXISTS`, meaning schema changes (new columns, index changes) are never automatically applied to existing tables. There is no migration tooling equivalent to Aerich for ClickHouse.
|
||
- Files: `app/core/clickhouse_init.py`
|
||
- Impact: If a column is added to the DDL, existing production tables silently retain the old schema; queries referencing new columns will fail at runtime.
|
||
- Fix approach: Implement `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` statements for schema evolution; or adopt a ClickHouse migration tool (e.g., `ch-migrations`).
|
||
|
||
### Startup File Lock Not Robust to Crashes
|
||
|
||
**Area:** Multi-Worker Initialization
|
||
- Issue: `init_data()` uses `os.mkdir(".startup_lock")` as a bare file lock with no TTL. If a worker crashes mid-initialization, the lock directory is never cleaned up, and subsequent restarts will skip all migrations and seed data.
|
||
- Files: `app/core/init_app.py` (lines 322–349)
|
||
- Impact: Silent data-initialization failure on restart after crash; can result in missing admin user or broken menu state.
|
||
- Fix approach: Replace with the existing `DistributedLock` class from `app/core/locks.py` which has TTL support; or at minimum add a time-based stale-lock cleanup.
|
||
|
||
---
|
||
|
||
## Performance Bottlenecks
|
||
|
||
### ClickHouse Singleton Connection Per-Worker
|
||
|
||
**Area:** Database Connections
|
||
- Problem: `clickhouse_manager` in `app/core/clickhouse.py` is a module-level singleton that lazily creates one connection. With 20 uvicorn workers (`UVICORN_WORKERS=20`), each worker maintains its own connection, and the connection pool inside each is limited to `maxsize=5`. Under concurrent load, requests block waiting for the pool.
|
||
- Files: `app/core/clickhouse.py` (lines 31–58)
|
||
- Cause: A new `PoolManager` is created on every `get_clickhouse_client()` call but the singleton pattern means `get_client()` only calls this once per worker — pool configuration cannot be tuned per endpoint.
|
||
- Improvement path: Use a connection pool library or connection factory pattern; expose pool size as a configurable setting; consider `clickhouse-pool` for proper async pooling.
|
||
|
||
### Dedup Query Scans Entire Table
|
||
|
||
**Area:** ClickHouse Deduplication
|
||
- Problem: `batch_dedup_filter` in `app/services/ingest/dedup.py` issues `SELECT key_col FROM table WHERE key_col IN (...)` with no time-bound filter. For large tables this is a full-table scan on each ingest batch.
|
||
- Files: `app/services/ingest/dedup.py` (lines 51, 65)
|
||
- Cause: No `PREWHERE created_at > now() - INTERVAL N DAY` constraint added.
|
||
- Improvement path: Add a configurable recency window to dedup queries (e.g., 30-day lookback matching the existing `days_back` logic in `company_cleaner.py`).
|
||
|
||
### Synchronous Crawler Calls Blocking Async Workers
|
||
|
||
**Area:** Concurrency
|
||
- Problem: All crawler service methods (BossService, QcwyService, ZhilianService) use synchronous `requests` internally, wrapped with `asyncio.to_thread()`. Under high concurrency, this can exhaust the thread pool.
|
||
- Files: `app/services/cleaning.py` (lines 186, 190, 194, etc.), `app/services/company_cleaner.py` (lines 318, 321, 325)
|
||
- Cause: The underlying crawler libraries use `requests` (sync), not `httpx` (async).
|
||
- Improvement path: Migrate crawlers to `httpx.AsyncClient` for true async I/O, or cap thread pool size via `asyncio.to_thread` semaphore.
|
||
|
||
---
|
||
|
||
## Fragile Areas
|
||
|
||
### `group_by_column` API Parameter Lacks Allowlist Validation
|
||
|
||
**Area:** Analytics API
|
||
- Files: `app/repositories/clickhouse_repo.py` (lines 86–115), `app/api/v1/analytics.py`
|
||
- Why fragile: `group_by_column` is passed directly into the ClickHouse query without validation against allowed column names. An invalid column name causes a runtime ClickHouse error that propagates as a 500 to the client; a malicious value enables injection.
|
||
- Safe modification: Add a `ALLOWED_GROUP_COLUMNS = frozenset({"source", "city", "channel", ...})` check before query construction.
|
||
- Test coverage: No tests for analytics query building.
|
||
|
||
### `DistributedLock` File Lock Has Race Window
|
||
|
||
**Area:** Distributed Locking
|
||
- Files: `app/core/locks.py` (lines 64–92)
|
||
- Why fragile: The stale-lock cleanup path (lines 81–87) involves `shutil.rmtree` followed by `mkdir` — a TOCTOU race where two workers can both detect staleness, both delete the directory, and both attempt to re-create it. One wins, one raises `FileExistsError` and falls through to return `False`, but the winner's lock is not guaranteed to be the intended holder.
|
||
- Safe modification: Use `os.rename` from a temp directory as an atomic lock-acquire primitive.
|
||
- Test coverage: None.
|
||
|
||
### `CleaningService` Returns Mixed Types (`bool | Dict`)
|
||
|
||
**Area:** Type Consistency
|
||
- Files: `app/services/cleaning.py` (lines 158–165, 167–202, 204–239, 241–268)
|
||
- Why fragile: Multiple `clean_*` methods return `Union[bool, Dict[str, Any]]`. Callers must defensively check `isinstance(result, bool)` before accessing dict keys (see `process_single_item` at lines 128–135). This makes the API fragile — any new call site that forgets the `bool` check will raise `AttributeError`.
|
||
- Safe modification: Define a `CleanResult` dataclass or TypedDict; always return it.
|
||
- Test coverage: None.
|
||
|
||
### `company_cleaning_job` Assumes 5-Minute Completion
|
||
|
||
**Area:** Scheduler
|
||
- Files: `app/core/scheduler.py` (lines 171–202)
|
||
- Why fragile: The job comment explicitly notes it processes 50 collect + 30 process operations and "should complete in 2–3 minutes" to fit within the 5-minute schedule window. If the external crawlers slow down (rate-limited, network issues), the job overruns its window, causing the next scheduled run to be blocked by the distributed lock, leading to queue backlog.
|
||
- Safe modification: Add timeout guards around `collect_pending_companies` and `process_pending_companies`; or dynamically reduce batch sizes based on elapsed time.
|
||
|
||
---
|
||
|
||
## Missing Critical Features
|
||
|
||
### No Rate Limiting on Any API Endpoint
|
||
|
||
**Problem:** FastAPI has no rate-limiting middleware configured. The data ingest endpoints (which have no auth) and the cleaning trigger endpoints can be called at arbitrary rates.
|
||
- Blocks: Protection against accidental or malicious flooding of ClickHouse with writes.
|
||
- Recommendation: Add `slowapi` or custom middleware with per-IP rate limiting on ingest and crawl-triggering endpoints.
|
||
|
||
### No Input Size Limit on Batch Ingest
|
||
|
||
**Problem:** `IngestBatchRequest.data_list` has no `max_items` constraint. A single request with tens of thousands of job records will be processed synchronously (or queued as a background task) with no guard on memory consumption.
|
||
- Files: `app/schemas/ingest.py`, `app/api/v1/job/job.py`
|
||
- Recommendation: Add `Field(max_length=1000)` or equivalent to `data_list`; add body size limit in uvicorn/nginx.
|
||
|
||
---
|
||
|
||
## Test Coverage Gaps
|
||
|
||
### Service Layer: Virtually Untested
|
||
|
||
- What's not tested: `CleaningService`, `CompanyCleaner`, `IngestService`, `AnalyticsService`, `DistributedLock`, `ClickHouseManager`
|
||
- Files: `app/services/cleaning.py`, `app/services/company_cleaner.py`, `app/services/ingest/service.py`, `app/services/analytics_service.py`, `app/core/locks.py`, `app/core/clickhouse.py`
|
||
- Risk: Core data routing, deduplication, and cleaning logic can regress silently. The existing tests only cover two pure-function helpers (`extract_company_fields`, `normalize_company_id`, `_extract_*_jobs`).
|
||
- Priority: **High**
|
||
|
||
### API Layer: No Integration Tests
|
||
|
||
- What's not tested: All FastAPI route handlers, auth flow, permission enforcement, error responses.
|
||
- Files: `app/api/v1/` (all modules)
|
||
- Risk: Auth bypass regressions, schema changes breaking clients, and incorrect HTTP status codes go undetected.
|
||
- Priority: **High**
|
||
|
||
### Spider Logic: No Unit Tests
|
||
|
||
- What's not tested: `IPAnomalyDetector.detect()`, `SmartIPManager` state transitions, response parsing in `BossZhipinAPI`, `QcwyApi`, `ZhilianApi`.
|
||
- Files: `jobs_spider/boss/boos_api.py`, `jobs_spider/qcwy/qcwy.py`, `jobs_spider/zhilian/zhilian_single.py`
|
||
- Risk: Anti-detection logic breaks silently when platforms change their response format.
|
||
- Priority: **Medium**
|
||
|
||
### ClickHouse Schema: No Schema Evolution Tests
|
||
|
||
- What's not tested: Behavior of `clickhouse_init.py` when tables already exist with old schemas; column migration paths.
|
||
- Files: `app/core/clickhouse_init.py`
|
||
- Risk: Production schema drift goes unnoticed until a query fails at runtime.
|
||
- Priority: **Medium**
|
||
|
||
---
|
||
|
||
## Known Issues / Observed Bugs
|
||
|
||
### Silent Error Swallowing in Spider Code
|
||
|
||
- Symptoms: Data silently not collected; no error surfaced to caller.
|
||
- Files: `jobs_spider/qcwy/qcwy.py` (lines 178–179, 191–192, 1047–1048, 1146–1147), `jobs_spider/qcwy/search_company_jobs.py` (lines 240–241, 512–513, 538–539, 601–602), `jobs_spider/boss/boos_api.py` (lines 400, 407–408, 508–509, 518–519, 1779–1780), `jobs_spider/zhilian/company_spider.py` (many `except: pass` blocks)
|
||
- Trigger: Network errors, parsing errors, or unexpected response shapes in spider code are caught by bare `except Exception: pass` blocks.
|
||
- Workaround: Check logs for INFO-level skips; monitor push success counts.
|
||
|
||
### `_send_email` Uses Blocking SMTP in Async Context
|
||
|
||
- Symptoms: Scheduler event loop blocked for up to several seconds during email send.
|
||
- Files: `app/core/scheduler.py` (lines 308–336)
|
||
- Trigger: Any `stats_job` or `ip_alert_job` execution that sends email calls synchronous `smtplib.SMTP` from the async event loop without wrapping in `asyncio.to_thread`.
|
||
- Workaround: Email delivery may be delayed; event loop latency spikes during scheduled runs.
|
||
|
||
### `cleanup_old_records` Deletes All Done/Failed Records Without Archive
|
||
|
||
- Symptoms: Historical processing data is permanently deleted daily; no audit trail of what was cleaned.
|
||
- Files: `app/services/company_cleaner.py` (lines 329–331), `app/core/scheduler.py` (line 220)
|
||
- Trigger: `daily_cleanup_job` calls `CompanyCleaningQueue.filter(status__in=["done", "failed"]).delete()` — a hard delete with no archiving or retention window.
|
||
|
||
---
|
||
|
||
*Concerns audit: 2026-03-21*
|