# Codebase Concerns **Analysis Date:** 2026-03-21 --- ## Security Considerations ### CRITICAL: Hardcoded Credentials in Source Code **Area:** Configuration - Risk: Real database passwords, SMTP credentials, and IP addresses committed to version control. - Files: `app/settings/config.py` - Exposed values (lines 27–52): - MySQL root password (`jobdata123`) with public IP `121.4.126.241:3306` - ClickHouse user/pass defaults (`data_user` / `data_pass`) with same public IP - QQ SMTP app password (`jkncedjzkummbbdi`) and recipient email addresses - Current mitigation: `pydantic-settings` allows environment variable override, but defaults are live credentials that will be exposed if the repo is shared or CI logs are leaked. - Recommendations: Remove all real default values; replace with `None` or clearly fake placeholders; enforce required-field validation at startup; rotate exposed credentials immediately. ### CRITICAL: Unauthenticated Data Ingest Endpoints **Area:** API Authentication - Risk: Any actor with network access can push arbitrary data directly into ClickHouse. - Files: `app/api/v1/__init__.py` (lines 32–34), `app/api/v1/job/job.py` - The `/ingest`, `/job`, and `/universal` router groups are mounted **without** any `DependPermission` or `DependAuth` dependency. The data store, batch-store, and batch-store-async endpoints accept unauthenticated POST requests. - Trigger: Send a POST to `/api/v1/universal/data/batch-store-async` with any payload. No token required. - Workaround: The CLAUDE.md acknowledges these as "internal call" endpoints, but they are publicly reachable with no network-level isolation noted. - Recommendations: Add at minimum IP allowlist middleware or a static shared secret header check for ingest routes; or mount them behind `DependAuth`. ### HIGH: Dev Token Bypass in Production Auth **Area:** Authentication - Risk: If `APP_ENV=development`, any request with `token: dev` header is granted the first user's full privileges. - Files: `app/core/dependency.py` (lines 28–30) - The check `os.getenv("APP_ENV", "production") == "development"` defaults to `"production"`, but there is no enforcement that this variable is set correctly in production deployments. A misconfigured environment would expose the bypass. - Recommendations: Remove the dev bypass entirely; use test fixtures instead; add a startup assertion that the bypass is disabled in non-dev environments. ### HIGH: SQL Injection via f-string Queries **Area:** ClickHouse Queries - Risk: User-controlled values inserted directly into query strings allow injection. - Files: - `reclean_qcwy_jobs.py` (line 34): `f"SELECT json_data FROM job_data.qcwy_job WHERE job_id = '{job_id}' LIMIT 1"` — `job_id` comes from URL parsing of user-provided input. - `app/repositories/clickhouse_repo.py` (lines 101–108): `group_by_column` parameter is interpolated directly into the SELECT and GROUP BY clauses without validation. - `app/services/ingest/service.py` (lines 109, 112): `table` value is derived from config (lower risk) but still concatenated into f-strings. - `app/core/scheduler.py` (lines 57, 63): `table` names from a fixed list, but pattern is risky. - Current mitigation: ClickHouse parameterized queries are used in `dedup.py`, but inconsistently applied elsewhere. - Recommendations: Use parameterized queries (`parameters={}`) for all user-influenced values; validate `group_by_column` against an allowlist; never interpolate user input into SQL strings. ### MEDIUM: CORS Misconfiguration **Area:** CORS Policy - Risk: `CORS_ALLOW_CREDENTIALS = False` while `CORS_ALLOW_HEADERS = ["*"]` is set. If credentials are later enabled, wildcard headers become a security liability. - Files: `app/settings/config.py` (lines 13–16) - Recommendations: Restrict `CORS_ORIGINS` to known production domains; avoid wildcard headers. ### MEDIUM: Hardcoded Public IP in Spider Default **Area:** Spider Configuration - Risk: Hardcoded IP `124.222.106.226:9999` as the default `API_BASE_URL` in all spider scripts and `spiderJobs/` runner. If that IP changes hands or is exposed, all crawled data goes to an unintended recipient. - Files: - `jobs_spider/boss/boos_api.py` (line 17) - `jobs_spider/qcwy/qcwy.py` (line 28) - `jobs_spider/zhilian/zhilian_single.py` (line 41) - `spiderJobs/runner/api_client.py` (line 31) - Recommendations: Remove hardcoded IP defaults; require `API_BASE_URL` to be explicitly set via environment variable with no default. --- ## Tech Debt ### Duplicate Spider Implementations **Area:** Code Organization - Issue: Two parallel spider hierarchies exist for the same three platforms. - `jobs_spider/boss/`, `jobs_spider/qcwy/`, `jobs_spider/zhilian/` — original scripts - `spiderJobs/platforms/boss/`, `spiderJobs/platforms/job51/`, `spiderJobs/platforms/zhilian/` — refactored version - `app/services/crawler/` — yet another crawler abstraction inside the app - Files: `jobs_spider/`, `spiderJobs/`, `app/services/crawler/` - Impact: Bug fixes and anti-detection improvements must be applied in multiple places; the relationship between these directories is not documented. - Fix approach: Decide on one canonical implementation; deprecate and delete the others; update CLAUDE.md to clarify which is the active spider. ### God File: `jobs_spider/boss/boos_api.py` **Area:** File Size / Complexity - Issue: Single file containing 2281 lines covering API client, IP strategy, anomaly detection, session management, main loop, and push logic. - Files: `jobs_spider/boss/boos_api.py` - Impact: Extremely difficult to test or modify any single responsibility; cognitive load is very high. - Fix approach: Extract `SmartIPManager`, `IPAnomalyDetector`, `IPStrategyConfig` into separate modules; separate the main crawl loop from the API client class. ### `CleaningService` and `CompanyCleaner` Code Duplication **Area:** Service Layer - Issue: Both classes replicate identical token-loading logic (`_ensure_boss_token_loaded`), proxy-setting logic (`_apply_proxy`), and service instantiation (BossService, QcwyService, ZhilianService). - Files: - `app/services/cleaning.py` (lines 37–48, 31–34) - `app/services/company_cleaner.py` (lines 60–74, 54–58) - Impact: Token refresh logic is maintained in two places; a bug fix in one will not propagate to the other. - Fix approach: Extract a shared `CrawlerContext` or `CrawlerMixin` class providing token management and proxy configuration. ### ClickHouse Schema Changes Require Manual DDL **Area:** Database Migrations - Issue: ClickHouse table definitions in `app/core/clickhouse_init.py` use `CREATE TABLE IF NOT EXISTS`, meaning schema changes (new columns, index changes) are never automatically applied to existing tables. There is no migration tooling equivalent to Aerich for ClickHouse. - Files: `app/core/clickhouse_init.py` - Impact: If a column is added to the DDL, existing production tables silently retain the old schema; queries referencing new columns will fail at runtime. - Fix approach: Implement `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` statements for schema evolution; or adopt a ClickHouse migration tool (e.g., `ch-migrations`). ### Startup File Lock Not Robust to Crashes **Area:** Multi-Worker Initialization - Issue: `init_data()` uses `os.mkdir(".startup_lock")` as a bare file lock with no TTL. If a worker crashes mid-initialization, the lock directory is never cleaned up, and subsequent restarts will skip all migrations and seed data. - Files: `app/core/init_app.py` (lines 322–349) - Impact: Silent data-initialization failure on restart after crash; can result in missing admin user or broken menu state. - Fix approach: Replace with the existing `DistributedLock` class from `app/core/locks.py` which has TTL support; or at minimum add a time-based stale-lock cleanup. --- ## Performance Bottlenecks ### ClickHouse Singleton Connection Per-Worker **Area:** Database Connections - Problem: `clickhouse_manager` in `app/core/clickhouse.py` is a module-level singleton that lazily creates one connection. With 20 uvicorn workers (`UVICORN_WORKERS=20`), each worker maintains its own connection, and the connection pool inside each is limited to `maxsize=5`. Under concurrent load, requests block waiting for the pool. - Files: `app/core/clickhouse.py` (lines 31–58) - Cause: A new `PoolManager` is created on every `get_clickhouse_client()` call but the singleton pattern means `get_client()` only calls this once per worker — pool configuration cannot be tuned per endpoint. - Improvement path: Use a connection pool library or connection factory pattern; expose pool size as a configurable setting; consider `clickhouse-pool` for proper async pooling. ### Dedup Query Scans Entire Table **Area:** ClickHouse Deduplication - Problem: `batch_dedup_filter` in `app/services/ingest/dedup.py` issues `SELECT key_col FROM table WHERE key_col IN (...)` with no time-bound filter. For large tables this is a full-table scan on each ingest batch. - Files: `app/services/ingest/dedup.py` (lines 51, 65) - Cause: No `PREWHERE created_at > now() - INTERVAL N DAY` constraint added. - Improvement path: Add a configurable recency window to dedup queries (e.g., 30-day lookback matching the existing `days_back` logic in `company_cleaner.py`). ### Synchronous Crawler Calls Blocking Async Workers **Area:** Concurrency - Problem: All crawler service methods (BossService, QcwyService, ZhilianService) use synchronous `requests` internally, wrapped with `asyncio.to_thread()`. Under high concurrency, this can exhaust the thread pool. - Files: `app/services/cleaning.py` (lines 186, 190, 194, etc.), `app/services/company_cleaner.py` (lines 318, 321, 325) - Cause: The underlying crawler libraries use `requests` (sync), not `httpx` (async). - Improvement path: Migrate crawlers to `httpx.AsyncClient` for true async I/O, or cap thread pool size via `asyncio.to_thread` semaphore. --- ## Fragile Areas ### `group_by_column` API Parameter Lacks Allowlist Validation **Area:** Analytics API - Files: `app/repositories/clickhouse_repo.py` (lines 86–115), `app/api/v1/analytics.py` - Why fragile: `group_by_column` is passed directly into the ClickHouse query without validation against allowed column names. An invalid column name causes a runtime ClickHouse error that propagates as a 500 to the client; a malicious value enables injection. - Safe modification: Add a `ALLOWED_GROUP_COLUMNS = frozenset({"source", "city", "channel", ...})` check before query construction. - Test coverage: No tests for analytics query building. ### `DistributedLock` File Lock Has Race Window **Area:** Distributed Locking - Files: `app/core/locks.py` (lines 64–92) - Why fragile: The stale-lock cleanup path (lines 81–87) involves `shutil.rmtree` followed by `mkdir` — a TOCTOU race where two workers can both detect staleness, both delete the directory, and both attempt to re-create it. One wins, one raises `FileExistsError` and falls through to return `False`, but the winner's lock is not guaranteed to be the intended holder. - Safe modification: Use `os.rename` from a temp directory as an atomic lock-acquire primitive. - Test coverage: None. ### `CleaningService` Returns Mixed Types (`bool | Dict`) **Area:** Type Consistency - Files: `app/services/cleaning.py` (lines 158–165, 167–202, 204–239, 241–268) - Why fragile: Multiple `clean_*` methods return `Union[bool, Dict[str, Any]]`. Callers must defensively check `isinstance(result, bool)` before accessing dict keys (see `process_single_item` at lines 128–135). This makes the API fragile — any new call site that forgets the `bool` check will raise `AttributeError`. - Safe modification: Define a `CleanResult` dataclass or TypedDict; always return it. - Test coverage: None. ### `company_cleaning_job` Assumes 5-Minute Completion **Area:** Scheduler - Files: `app/core/scheduler.py` (lines 171–202) - Why fragile: The job comment explicitly notes it processes 50 collect + 30 process operations and "should complete in 2–3 minutes" to fit within the 5-minute schedule window. If the external crawlers slow down (rate-limited, network issues), the job overruns its window, causing the next scheduled run to be blocked by the distributed lock, leading to queue backlog. - Safe modification: Add timeout guards around `collect_pending_companies` and `process_pending_companies`; or dynamically reduce batch sizes based on elapsed time. --- ## Missing Critical Features ### No Rate Limiting on Any API Endpoint **Problem:** FastAPI has no rate-limiting middleware configured. The data ingest endpoints (which have no auth) and the cleaning trigger endpoints can be called at arbitrary rates. - Blocks: Protection against accidental or malicious flooding of ClickHouse with writes. - Recommendation: Add `slowapi` or custom middleware with per-IP rate limiting on ingest and crawl-triggering endpoints. ### No Input Size Limit on Batch Ingest **Problem:** `IngestBatchRequest.data_list` has no `max_items` constraint. A single request with tens of thousands of job records will be processed synchronously (or queued as a background task) with no guard on memory consumption. - Files: `app/schemas/ingest.py`, `app/api/v1/job/job.py` - Recommendation: Add `Field(max_length=1000)` or equivalent to `data_list`; add body size limit in uvicorn/nginx. --- ## Test Coverage Gaps ### Service Layer: Virtually Untested - What's not tested: `CleaningService`, `CompanyCleaner`, `IngestService`, `AnalyticsService`, `DistributedLock`, `ClickHouseManager` - Files: `app/services/cleaning.py`, `app/services/company_cleaner.py`, `app/services/ingest/service.py`, `app/services/analytics_service.py`, `app/core/locks.py`, `app/core/clickhouse.py` - Risk: Core data routing, deduplication, and cleaning logic can regress silently. The existing tests only cover two pure-function helpers (`extract_company_fields`, `normalize_company_id`, `_extract_*_jobs`). - Priority: **High** ### API Layer: No Integration Tests - What's not tested: All FastAPI route handlers, auth flow, permission enforcement, error responses. - Files: `app/api/v1/` (all modules) - Risk: Auth bypass regressions, schema changes breaking clients, and incorrect HTTP status codes go undetected. - Priority: **High** ### Spider Logic: No Unit Tests - What's not tested: `IPAnomalyDetector.detect()`, `SmartIPManager` state transitions, response parsing in `BossZhipinAPI`, `QcwyApi`, `ZhilianApi`. - Files: `jobs_spider/boss/boos_api.py`, `jobs_spider/qcwy/qcwy.py`, `jobs_spider/zhilian/zhilian_single.py` - Risk: Anti-detection logic breaks silently when platforms change their response format. - Priority: **Medium** ### ClickHouse Schema: No Schema Evolution Tests - What's not tested: Behavior of `clickhouse_init.py` when tables already exist with old schemas; column migration paths. - Files: `app/core/clickhouse_init.py` - Risk: Production schema drift goes unnoticed until a query fails at runtime. - Priority: **Medium** --- ## Known Issues / Observed Bugs ### Silent Error Swallowing in Spider Code - Symptoms: Data silently not collected; no error surfaced to caller. - Files: `jobs_spider/qcwy/qcwy.py` (lines 178–179, 191–192, 1047–1048, 1146–1147), `jobs_spider/qcwy/search_company_jobs.py` (lines 240–241, 512–513, 538–539, 601–602), `jobs_spider/boss/boos_api.py` (lines 400, 407–408, 508–509, 518–519, 1779–1780), `jobs_spider/zhilian/company_spider.py` (many `except: pass` blocks) - Trigger: Network errors, parsing errors, or unexpected response shapes in spider code are caught by bare `except Exception: pass` blocks. - Workaround: Check logs for INFO-level skips; monitor push success counts. ### `_send_email` Uses Blocking SMTP in Async Context - Symptoms: Scheduler event loop blocked for up to several seconds during email send. - Files: `app/core/scheduler.py` (lines 308–336) - Trigger: Any `stats_job` or `ip_alert_job` execution that sends email calls synchronous `smtplib.SMTP` from the async event loop without wrapping in `asyncio.to_thread`. - Workaround: Email delivery may be delayed; event loop latency spikes during scheduled runs. ### `cleanup_old_records` Deletes All Done/Failed Records Without Archive - Symptoms: Historical processing data is permanently deleted daily; no audit trail of what was cleaned. - Files: `app/services/company_cleaner.py` (lines 329–331), `app/core/scheduler.py` (line 220) - Trigger: `daily_cleanup_job` calls `CompanyCleaningQueue.filter(status__in=["done", "failed"]).delete()` — a hard delete with no archiving or retention window. --- *Concerns audit: 2026-03-21*