JobData/.planning/codebase/CONCERNS.md

# Codebase Concerns

**Analysis Date:** 2026-03-21

---

## Security Considerations

### CRITICAL: Hardcoded Credentials in Source Code

**Area:** Configuration
- Risk: Real database passwords, SMTP credentials, and IP addresses committed to version control.
- Files: `app/settings/config.py`
- Exposed values (lines 27–52):
  - MySQL root password (`jobdata123`) with public IP `121.4.126.241:3306`
  - ClickHouse user/pass defaults (`data_user` / `data_pass`) with same public IP
  - QQ SMTP app password (`jkncedjzkummbbdi`) and recipient email addresses
- Current mitigation: `pydantic-settings` allows environment variable override, but defaults are live credentials that will be exposed if the repo is shared or CI logs are leaked.
- Recommendations: Remove all real default values; replace with `None` or clearly fake placeholders; enforce required-field validation at startup; rotate exposed credentials immediately.

### CRITICAL: Unauthenticated Data Ingest Endpoints

**Area:** API Authentication
- Risk: Any actor with network access can push arbitrary data directly into ClickHouse.
- Files: `app/api/v1/__init__.py` (lines 32–34), `app/api/v1/job/job.py`
- The `/ingest`, `/job`, and `/universal` router groups are mounted **without** any `DependPermission` or `DependAuth` dependency. The data store, batch-store, and batch-store-async endpoints accept unauthenticated POST requests.
- Trigger: Send a POST to `/api/v1/universal/data/batch-store-async` with any payload. No token required.
- Workaround: The CLAUDE.md acknowledges these as "internal call" endpoints, but they are publicly reachable with no network-level isolation noted.
- Recommendations: Add at minimum IP allowlist middleware or a static shared secret header check for ingest routes; or mount them behind `DependAuth`.

### HIGH: Dev Token Bypass in Production Auth

**Area:** Authentication
- Risk: If `APP_ENV=development`, any request with `token: dev` header is granted the first user's full privileges.
- Files: `app/core/dependency.py` (lines 28–30)
- The check `os.getenv("APP_ENV", "production") == "development"` defaults to `"production"`, but there is no enforcement that this variable is set correctly in production deployments. A misconfigured environment would expose the bypass.
- Recommendations: Remove the dev bypass entirely; use test fixtures instead; add a startup assertion that the bypass is disabled in non-dev environments.

### HIGH: SQL Injection via f-string Queries

**Area:** ClickHouse Queries
- Risk: User-controlled values inserted directly into query strings allow injection.
- Files:
  - `reclean_qcwy_jobs.py` (line 34): `f"SELECT json_data FROM job_data.qcwy_job WHERE job_id = '{job_id}' LIMIT 1"` — `job_id` comes from URL parsing of user-provided input.
  - `app/repositories/clickhouse_repo.py` (lines 101–108): `group_by_column` parameter is interpolated directly into the SELECT and GROUP BY clauses without validation.
  - `app/services/ingest/service.py` (lines 109, 112): `table` value is derived from config (lower risk) but still concatenated into f-strings.
  - `app/core/scheduler.py` (lines 57, 63): `table` names from a fixed list, but pattern is risky.
- Current mitigation: ClickHouse parameterized queries are used in `dedup.py`, but inconsistently applied elsewhere.
- Recommendations: Use parameterized queries (`parameters={}`) for all user-influenced values; validate `group_by_column` against an allowlist; never interpolate user input into SQL strings.

### MEDIUM: CORS Misconfiguration

**Area:** CORS Policy
- Risk: `CORS_ALLOW_CREDENTIALS = False` while `CORS_ALLOW_HEADERS = ["*"]` is set. If credentials are later enabled, wildcard headers become a security liability.
- Files: `app/settings/config.py` (lines 13–16)
- Recommendations: Restrict `CORS_ORIGINS` to known production domains; avoid wildcard headers.

### MEDIUM: Hardcoded Public IP in Spider Default

**Area:** Spider Configuration
- Risk: Hardcoded IP `124.222.106.226:9999` as the default `API_BASE_URL` in all spider scripts and `spiderJobs/` runner. If that IP changes hands or is exposed, all crawled data goes to an unintended recipient.
- Files:
  - `jobs_spider/boss/boos_api.py` (line 17)
  - `jobs_spider/qcwy/qcwy.py` (line 28)
  - `jobs_spider/zhilian/zhilian_single.py` (line 41)
  - `spiderJobs/runner/api_client.py` (line 31)
- Recommendations: Remove hardcoded IP defaults; require `API_BASE_URL` to be explicitly set via environment variable with no default.

---

## Tech Debt

### Duplicate Spider Implementations

**Area:** Code Organization
- Issue: Two parallel spider hierarchies exist for the same three platforms.
  - `jobs_spider/boss/`, `jobs_spider/qcwy/`, `jobs_spider/zhilian/` — original scripts
  - `spiderJobs/platforms/boss/`, `spiderJobs/platforms/job51/`, `spiderJobs/platforms/zhilian/` — refactored version
  - `app/services/crawler/` — yet another crawler abstraction inside the app
- Files: `jobs_spider/`, `spiderJobs/`, `app/services/crawler/`
- Impact: Bug fixes and anti-detection improvements must be applied in multiple places; the relationship between these directories is not documented.
- Fix approach: Decide on one canonical implementation; deprecate and delete the others; update CLAUDE.md to clarify which is the active spider.

### God File: `jobs_spider/boss/boos_api.py`

**Area:** File Size / Complexity
- Issue: Single file containing 2281 lines covering API client, IP strategy, anomaly detection, session management, main loop, and push logic.
- Files: `jobs_spider/boss/boos_api.py`
- Impact: Extremely difficult to test or modify any single responsibility; cognitive load is very high.
- Fix approach: Extract `SmartIPManager`, `IPAnomalyDetector`, `IPStrategyConfig` into separate modules; separate the main crawl loop from the API client class.

### `CleaningService` and `CompanyCleaner` Code Duplication

**Area:** Service Layer
- Issue: Both classes replicate identical token-loading logic (`_ensure_boss_token_loaded`), proxy-setting logic (`_apply_proxy`), and service instantiation (BossService, QcwyService, ZhilianService).
- Files:
  - `app/services/cleaning.py` (lines 37–48, 31–34)
  - `app/services/company_cleaner.py` (lines 60–74, 54–58)
- Impact: Token refresh logic is maintained in two places; a bug fix in one will not propagate to the other.
- Fix approach: Extract a shared `CrawlerContext` or `CrawlerMixin` class providing token management and proxy configuration.

### ClickHouse Schema Changes Require Manual DDL

**Area:** Database Migrations
- Issue: ClickHouse table definitions in `app/core/clickhouse_init.py` use `CREATE TABLE IF NOT EXISTS`, meaning schema changes (new columns, index changes) are never automatically applied to existing tables. There is no migration tooling equivalent to Aerich for ClickHouse.
- Files: `app/core/clickhouse_init.py`
- Impact: If a column is added to the DDL, existing production tables silently retain the old schema; queries referencing new columns will fail at runtime.
- Fix approach: Implement `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` statements for schema evolution; or adopt a ClickHouse migration tool (e.g., `ch-migrations`).

### Startup File Lock Not Robust to Crashes

**Area:** Multi-Worker Initialization
- Issue: `init_data()` uses `os.mkdir(".startup_lock")` as a bare file lock with no TTL. If a worker crashes mid-initialization, the lock directory is never cleaned up, and subsequent restarts will skip all migrations and seed data.
- Files: `app/core/init_app.py` (lines 322–349)
- Impact: Silent data-initialization failure on restart after crash; can result in missing admin user or broken menu state.
- Fix approach: Replace with the existing `DistributedLock` class from `app/core/locks.py` which has TTL support; or at minimum add a time-based stale-lock cleanup.

---

## Performance Bottlenecks

### ClickHouse Singleton Connection Per-Worker

**Area:** Database Connections
- Problem: `clickhouse_manager` in `app/core/clickhouse.py` is a module-level singleton that lazily creates one connection. With 20 uvicorn workers (`UVICORN_WORKERS=20`), each worker maintains its own connection, and the connection pool inside each is limited to `maxsize=5`. Under concurrent load, requests block waiting for the pool.
- Files: `app/core/clickhouse.py` (lines 31–58)
- Cause: A new `PoolManager` is created on every `get_clickhouse_client()` call but the singleton pattern means `get_client()` only calls this once per worker — pool configuration cannot be tuned per endpoint.
- Improvement path: Use a connection pool library or connection factory pattern; expose pool size as a configurable setting; consider `clickhouse-pool` for proper async pooling.

### Dedup Query Scans Entire Table

**Area:** ClickHouse Deduplication
- Problem: `batch_dedup_filter` in `app/services/ingest/dedup.py` issues `SELECT key_col FROM table WHERE key_col IN (...)` with no time-bound filter. For large tables this is a full-table scan on each ingest batch.
- Files: `app/services/ingest/dedup.py` (lines 51, 65)
- Cause: No `PREWHERE created_at > now() - INTERVAL N DAY` constraint added.
- Improvement path: Add a configurable recency window to dedup queries (e.g., 30-day lookback matching the existing `days_back` logic in `company_cleaner.py`).

### Synchronous Crawler Calls Blocking Async Workers

**Area:** Concurrency
- Problem: All crawler service methods (BossService, QcwyService, ZhilianService) use synchronous `requests` internally, wrapped with `asyncio.to_thread()`. Under high concurrency, this can exhaust the thread pool.
- Files: `app/services/cleaning.py` (lines 186, 190, 194, etc.), `app/services/company_cleaner.py` (lines 318, 321, 325)
- Cause: The underlying crawler libraries use `requests` (sync), not `httpx` (async).
- Improvement path: Migrate crawlers to `httpx.AsyncClient` for true async I/O, or cap thread pool size via `asyncio.to_thread` semaphore.

---

## Fragile Areas

### `group_by_column` API Parameter Lacks Allowlist Validation

**Area:** Analytics API
- Files: `app/repositories/clickhouse_repo.py` (lines 86–115), `app/api/v1/analytics.py`
- Why fragile: `group_by_column` is passed directly into the ClickHouse query without validation against allowed column names. An invalid column name causes a runtime ClickHouse error that propagates as a 500 to the client; a malicious value enables injection.
- Safe modification: Add a `ALLOWED_GROUP_COLUMNS = frozenset({"source", "city", "channel", ...})` check before query construction.
- Test coverage: No tests for analytics query building.

### `DistributedLock` File Lock Has Race Window

**Area:** Distributed Locking
- Files: `app/core/locks.py` (lines 64–92)
- Why fragile: The stale-lock cleanup path (lines 81–87) involves `shutil.rmtree` followed by `mkdir` — a TOCTOU race where two workers can both detect staleness, both delete the directory, and both attempt to re-create it. One wins, one raises `FileExistsError` and falls through to return `False`, but the winner's lock is not guaranteed to be the intended holder.
- Safe modification: Use `os.rename` from a temp directory as an atomic lock-acquire primitive.
- Test coverage: None.

### `CleaningService` Returns Mixed Types (`bool | Dict`)

**Area:** Type Consistency
- Files: `app/services/cleaning.py` (lines 158–165, 167–202, 204–239, 241–268)
- Why fragile: Multiple `clean_*` methods return `Union[bool, Dict[str, Any]]`. Callers must defensively check `isinstance(result, bool)` before accessing dict keys (see `process_single_item` at lines 128–135). This makes the API fragile — any new call site that forgets the `bool` check will raise `AttributeError`.
- Safe modification: Define a `CleanResult` dataclass or TypedDict; always return it.
- Test coverage: None.

### `company_cleaning_job` Assumes 5-Minute Completion

**Area:** Scheduler
- Files: `app/core/scheduler.py` (lines 171–202)
- Why fragile: The job comment explicitly notes it processes 50 collect + 30 process operations and "should complete in 2–3 minutes" to fit within the 5-minute schedule window. If the external crawlers slow down (rate-limited, network issues), the job overruns its window, causing the next scheduled run to be blocked by the distributed lock, leading to queue backlog.
- Safe modification: Add timeout guards around `collect_pending_companies` and `process_pending_companies`; or dynamically reduce batch sizes based on elapsed time.

---

## Missing Critical Features

### No Rate Limiting on Any API Endpoint

**Problem:** FastAPI has no rate-limiting middleware configured. The data ingest endpoints (which have no auth) and the cleaning trigger endpoints can be called at arbitrary rates.
- Blocks: Protection against accidental or malicious flooding of ClickHouse with writes.
- Recommendation: Add `slowapi` or custom middleware with per-IP rate limiting on ingest and crawl-triggering endpoints.

### No Input Size Limit on Batch Ingest

**Problem:** `IngestBatchRequest.data_list` has no `max_items` constraint. A single request with tens of thousands of job records will be processed synchronously (or queued as a background task) with no guard on memory consumption.
- Files: `app/schemas/ingest.py`, `app/api/v1/job/job.py`
- Recommendation: Add `Field(max_length=1000)` or equivalent to `data_list`; add body size limit in uvicorn/nginx.

---

## Test Coverage Gaps

### Service Layer: Virtually Untested

- What's not tested: `CleaningService`, `CompanyCleaner`, `IngestService`, `AnalyticsService`, `DistributedLock`, `ClickHouseManager`
- Files: `app/services/cleaning.py`, `app/services/company_cleaner.py`, `app/services/ingest/service.py`, `app/services/analytics_service.py`, `app/core/locks.py`, `app/core/clickhouse.py`
- Risk: Core data routing, deduplication, and cleaning logic can regress silently. The existing tests only cover two pure-function helpers (`extract_company_fields`, `normalize_company_id`, `_extract_*_jobs`).
- Priority: **High**

### API Layer: No Integration Tests

- What's not tested: All FastAPI route handlers, auth flow, permission enforcement, error responses.
- Files: `app/api/v1/` (all modules)
- Risk: Auth bypass regressions, schema changes breaking clients, and incorrect HTTP status codes go undetected.
- Priority: **High**

### Spider Logic: No Unit Tests

- What's not tested: `IPAnomalyDetector.detect()`, `SmartIPManager` state transitions, response parsing in `BossZhipinAPI`, `QcwyApi`, `ZhilianApi`.
- Files: `jobs_spider/boss/boos_api.py`, `jobs_spider/qcwy/qcwy.py`, `jobs_spider/zhilian/zhilian_single.py`
- Risk: Anti-detection logic breaks silently when platforms change their response format.
- Priority: **Medium**

### ClickHouse Schema: No Schema Evolution Tests

- What's not tested: Behavior of `clickhouse_init.py` when tables already exist with old schemas; column migration paths.
- Files: `app/core/clickhouse_init.py`
- Risk: Production schema drift goes unnoticed until a query fails at runtime.
- Priority: **Medium**

---

## Known Issues / Observed Bugs

### Silent Error Swallowing in Spider Code

- Symptoms: Data silently not collected; no error surfaced to caller.
- Files: `jobs_spider/qcwy/qcwy.py` (lines 178–179, 191–192, 1047–1048, 1146–1147), `jobs_spider/qcwy/search_company_jobs.py` (lines 240–241, 512–513, 538–539, 601–602), `jobs_spider/boss/boos_api.py` (lines 400, 407–408, 508–509, 518–519, 1779–1780), `jobs_spider/zhilian/company_spider.py` (many `except: pass` blocks)
- Trigger: Network errors, parsing errors, or unexpected response shapes in spider code are caught by bare `except Exception: pass` blocks.
- Workaround: Check logs for INFO-level skips; monitor push success counts.

### `_send_email` Uses Blocking SMTP in Async Context

- Symptoms: Scheduler event loop blocked for up to several seconds during email send.
- Files: `app/core/scheduler.py` (lines 308–336)
- Trigger: Any `stats_job` or `ip_alert_job` execution that sends email calls synchronous `smtplib.SMTP` from the async event loop without wrapping in `asyncio.to_thread`.
- Workaround: Email delivery may be delayed; event loop latency spikes during scheduled runs.

### `cleanup_old_records` Deletes All Done/Failed Records Without Archive

- Symptoms: Historical processing data is permanently deleted daily; no audit trail of what was cleaned.
- Files: `app/services/company_cleaner.py` (lines 329–331), `app/core/scheduler.py` (line 220)
- Trigger: `daily_cleanup_job` calls `CompanyCleaningQueue.filter(status__in=["done", "failed"]).delete()` — a hard delete with no archiving or retention window.

---

*Concerns audit: 2026-03-21*