win 00a727519f docs(phase-3): complete execution — 2/2 plans, 98 tests passing
- ARCH-04: job51 migrated to crawler_core (no old deps)
- ARCH-05: zhilian migrated to crawler_core (no old deps)
- 34 new mock tests (17 job51 + 17 zhilian)
- Added _parse_zhilian_response custom parser for zhilian API format
- Fixed POST Searcher _request() overrides for job51/zhilian
- Full regression: 98 passed in 0.12s
2026-03-21 19:19:17 +08:00

16 KiB
Raw Permalink Blame History

Codebase Concerns

Analysis Date: 2026-03-21


Security Considerations

CRITICAL: Hardcoded Credentials in Source Code

Area: Configuration

  • Risk: Real database passwords, SMTP credentials, and IP addresses committed to version control.
  • Files: app/settings/config.py
  • Exposed values (lines 2752):
    • MySQL root password (jobdata123) with public IP 121.4.126.241:3306
    • ClickHouse user/pass defaults (data_user / data_pass) with same public IP
    • QQ SMTP app password (jkncedjzkummbbdi) and recipient email addresses
  • Current mitigation: pydantic-settings allows environment variable override, but defaults are live credentials that will be exposed if the repo is shared or CI logs are leaked.
  • Recommendations: Remove all real default values; replace with None or clearly fake placeholders; enforce required-field validation at startup; rotate exposed credentials immediately.

CRITICAL: Unauthenticated Data Ingest Endpoints

Area: API Authentication

  • Risk: Any actor with network access can push arbitrary data directly into ClickHouse.
  • Files: app/api/v1/__init__.py (lines 3234), app/api/v1/job/job.py
  • The /ingest, /job, and /universal router groups are mounted without any DependPermission or DependAuth dependency. The data store, batch-store, and batch-store-async endpoints accept unauthenticated POST requests.
  • Trigger: Send a POST to /api/v1/universal/data/batch-store-async with any payload. No token required.
  • Workaround: The CLAUDE.md acknowledges these as "internal call" endpoints, but they are publicly reachable with no network-level isolation noted.
  • Recommendations: Add at minimum IP allowlist middleware or a static shared secret header check for ingest routes; or mount them behind DependAuth.

HIGH: Dev Token Bypass in Production Auth

Area: Authentication

  • Risk: If APP_ENV=development, any request with token: dev header is granted the first user's full privileges.
  • Files: app/core/dependency.py (lines 2830)
  • The check os.getenv("APP_ENV", "production") == "development" defaults to "production", but there is no enforcement that this variable is set correctly in production deployments. A misconfigured environment would expose the bypass.
  • Recommendations: Remove the dev bypass entirely; use test fixtures instead; add a startup assertion that the bypass is disabled in non-dev environments.

HIGH: SQL Injection via f-string Queries

Area: ClickHouse Queries

  • Risk: User-controlled values inserted directly into query strings allow injection.
  • Files:
    • reclean_qcwy_jobs.py (line 34): f"SELECT json_data FROM job_data.qcwy_job WHERE job_id = '{job_id}' LIMIT 1"job_id comes from URL parsing of user-provided input.
    • app/repositories/clickhouse_repo.py (lines 101108): group_by_column parameter is interpolated directly into the SELECT and GROUP BY clauses without validation.
    • app/services/ingest/service.py (lines 109, 112): table value is derived from config (lower risk) but still concatenated into f-strings.
    • app/core/scheduler.py (lines 57, 63): table names from a fixed list, but pattern is risky.
  • Current mitigation: ClickHouse parameterized queries are used in dedup.py, but inconsistently applied elsewhere.
  • Recommendations: Use parameterized queries (parameters={}) for all user-influenced values; validate group_by_column against an allowlist; never interpolate user input into SQL strings.

MEDIUM: CORS Misconfiguration

Area: CORS Policy

  • Risk: CORS_ALLOW_CREDENTIALS = False while CORS_ALLOW_HEADERS = ["*"] is set. If credentials are later enabled, wildcard headers become a security liability.
  • Files: app/settings/config.py (lines 1316)
  • Recommendations: Restrict CORS_ORIGINS to known production domains; avoid wildcard headers.

MEDIUM: Hardcoded Public IP in Spider Default

Area: Spider Configuration

  • Risk: Hardcoded IP 124.222.106.226:9999 as the default API_BASE_URL in all spider scripts and spiderJobs/ runner. If that IP changes hands or is exposed, all crawled data goes to an unintended recipient.
  • Files:
    • jobs_spider/boss/boos_api.py (line 17)
    • jobs_spider/qcwy/qcwy.py (line 28)
    • jobs_spider/zhilian/zhilian_single.py (line 41)
    • spiderJobs/runner/api_client.py (line 31)
  • Recommendations: Remove hardcoded IP defaults; require API_BASE_URL to be explicitly set via environment variable with no default.

Tech Debt

Duplicate Spider Implementations

Area: Code Organization

  • Issue: Two parallel spider hierarchies exist for the same three platforms.
    • jobs_spider/boss/, jobs_spider/qcwy/, jobs_spider/zhilian/ — original scripts
    • spiderJobs/platforms/boss/, spiderJobs/platforms/job51/, spiderJobs/platforms/zhilian/ — refactored version
    • app/services/crawler/ — yet another crawler abstraction inside the app
  • Files: jobs_spider/, spiderJobs/, app/services/crawler/
  • Impact: Bug fixes and anti-detection improvements must be applied in multiple places; the relationship between these directories is not documented.
  • Fix approach: Decide on one canonical implementation; deprecate and delete the others; update CLAUDE.md to clarify which is the active spider.

God File: jobs_spider/boss/boos_api.py

Area: File Size / Complexity

  • Issue: Single file containing 2281 lines covering API client, IP strategy, anomaly detection, session management, main loop, and push logic.
  • Files: jobs_spider/boss/boos_api.py
  • Impact: Extremely difficult to test or modify any single responsibility; cognitive load is very high.
  • Fix approach: Extract SmartIPManager, IPAnomalyDetector, IPStrategyConfig into separate modules; separate the main crawl loop from the API client class.

CleaningService and CompanyCleaner Code Duplication

Area: Service Layer

  • Issue: Both classes replicate identical token-loading logic (_ensure_boss_token_loaded), proxy-setting logic (_apply_proxy), and service instantiation (BossService, QcwyService, ZhilianService).
  • Files:
    • app/services/cleaning.py (lines 3748, 3134)
    • app/services/company_cleaner.py (lines 6074, 5458)
  • Impact: Token refresh logic is maintained in two places; a bug fix in one will not propagate to the other.
  • Fix approach: Extract a shared CrawlerContext or CrawlerMixin class providing token management and proxy configuration.

ClickHouse Schema Changes Require Manual DDL

Area: Database Migrations

  • Issue: ClickHouse table definitions in app/core/clickhouse_init.py use CREATE TABLE IF NOT EXISTS, meaning schema changes (new columns, index changes) are never automatically applied to existing tables. There is no migration tooling equivalent to Aerich for ClickHouse.
  • Files: app/core/clickhouse_init.py
  • Impact: If a column is added to the DDL, existing production tables silently retain the old schema; queries referencing new columns will fail at runtime.
  • Fix approach: Implement ALTER TABLE ... ADD COLUMN IF NOT EXISTS statements for schema evolution; or adopt a ClickHouse migration tool (e.g., ch-migrations).

Startup File Lock Not Robust to Crashes

Area: Multi-Worker Initialization

  • Issue: init_data() uses os.mkdir(".startup_lock") as a bare file lock with no TTL. If a worker crashes mid-initialization, the lock directory is never cleaned up, and subsequent restarts will skip all migrations and seed data.
  • Files: app/core/init_app.py (lines 322349)
  • Impact: Silent data-initialization failure on restart after crash; can result in missing admin user or broken menu state.
  • Fix approach: Replace with the existing DistributedLock class from app/core/locks.py which has TTL support; or at minimum add a time-based stale-lock cleanup.

Performance Bottlenecks

ClickHouse Singleton Connection Per-Worker

Area: Database Connections

  • Problem: clickhouse_manager in app/core/clickhouse.py is a module-level singleton that lazily creates one connection. With 20 uvicorn workers (UVICORN_WORKERS=20), each worker maintains its own connection, and the connection pool inside each is limited to maxsize=5. Under concurrent load, requests block waiting for the pool.
  • Files: app/core/clickhouse.py (lines 3158)
  • Cause: A new PoolManager is created on every get_clickhouse_client() call but the singleton pattern means get_client() only calls this once per worker — pool configuration cannot be tuned per endpoint.
  • Improvement path: Use a connection pool library or connection factory pattern; expose pool size as a configurable setting; consider clickhouse-pool for proper async pooling.

Dedup Query Scans Entire Table

Area: ClickHouse Deduplication

  • Problem: batch_dedup_filter in app/services/ingest/dedup.py issues SELECT key_col FROM table WHERE key_col IN (...) with no time-bound filter. For large tables this is a full-table scan on each ingest batch.
  • Files: app/services/ingest/dedup.py (lines 51, 65)
  • Cause: No PREWHERE created_at > now() - INTERVAL N DAY constraint added.
  • Improvement path: Add a configurable recency window to dedup queries (e.g., 30-day lookback matching the existing days_back logic in company_cleaner.py).

Synchronous Crawler Calls Blocking Async Workers

Area: Concurrency

  • Problem: All crawler service methods (BossService, QcwyService, ZhilianService) use synchronous requests internally, wrapped with asyncio.to_thread(). Under high concurrency, this can exhaust the thread pool.
  • Files: app/services/cleaning.py (lines 186, 190, 194, etc.), app/services/company_cleaner.py (lines 318, 321, 325)
  • Cause: The underlying crawler libraries use requests (sync), not httpx (async).
  • Improvement path: Migrate crawlers to httpx.AsyncClient for true async I/O, or cap thread pool size via asyncio.to_thread semaphore.

Fragile Areas

group_by_column API Parameter Lacks Allowlist Validation

Area: Analytics API

  • Files: app/repositories/clickhouse_repo.py (lines 86115), app/api/v1/analytics.py
  • Why fragile: group_by_column is passed directly into the ClickHouse query without validation against allowed column names. An invalid column name causes a runtime ClickHouse error that propagates as a 500 to the client; a malicious value enables injection.
  • Safe modification: Add a ALLOWED_GROUP_COLUMNS = frozenset({"source", "city", "channel", ...}) check before query construction.
  • Test coverage: No tests for analytics query building.

DistributedLock File Lock Has Race Window

Area: Distributed Locking

  • Files: app/core/locks.py (lines 6492)
  • Why fragile: The stale-lock cleanup path (lines 8187) involves shutil.rmtree followed by mkdir — a TOCTOU race where two workers can both detect staleness, both delete the directory, and both attempt to re-create it. One wins, one raises FileExistsError and falls through to return False, but the winner's lock is not guaranteed to be the intended holder.
  • Safe modification: Use os.rename from a temp directory as an atomic lock-acquire primitive.
  • Test coverage: None.

CleaningService Returns Mixed Types (bool | Dict)

Area: Type Consistency

  • Files: app/services/cleaning.py (lines 158165, 167202, 204239, 241268)
  • Why fragile: Multiple clean_* methods return Union[bool, Dict[str, Any]]. Callers must defensively check isinstance(result, bool) before accessing dict keys (see process_single_item at lines 128135). This makes the API fragile — any new call site that forgets the bool check will raise AttributeError.
  • Safe modification: Define a CleanResult dataclass or TypedDict; always return it.
  • Test coverage: None.

company_cleaning_job Assumes 5-Minute Completion

Area: Scheduler

  • Files: app/core/scheduler.py (lines 171202)
  • Why fragile: The job comment explicitly notes it processes 50 collect + 30 process operations and "should complete in 23 minutes" to fit within the 5-minute schedule window. If the external crawlers slow down (rate-limited, network issues), the job overruns its window, causing the next scheduled run to be blocked by the distributed lock, leading to queue backlog.
  • Safe modification: Add timeout guards around collect_pending_companies and process_pending_companies; or dynamically reduce batch sizes based on elapsed time.

Missing Critical Features

No Rate Limiting on Any API Endpoint

Problem: FastAPI has no rate-limiting middleware configured. The data ingest endpoints (which have no auth) and the cleaning trigger endpoints can be called at arbitrary rates.

  • Blocks: Protection against accidental or malicious flooding of ClickHouse with writes.
  • Recommendation: Add slowapi or custom middleware with per-IP rate limiting on ingest and crawl-triggering endpoints.

No Input Size Limit on Batch Ingest

Problem: IngestBatchRequest.data_list has no max_items constraint. A single request with tens of thousands of job records will be processed synchronously (or queued as a background task) with no guard on memory consumption.

  • Files: app/schemas/ingest.py, app/api/v1/job/job.py
  • Recommendation: Add Field(max_length=1000) or equivalent to data_list; add body size limit in uvicorn/nginx.

Test Coverage Gaps

Service Layer: Virtually Untested

  • What's not tested: CleaningService, CompanyCleaner, IngestService, AnalyticsService, DistributedLock, ClickHouseManager
  • Files: app/services/cleaning.py, app/services/company_cleaner.py, app/services/ingest/service.py, app/services/analytics_service.py, app/core/locks.py, app/core/clickhouse.py
  • Risk: Core data routing, deduplication, and cleaning logic can regress silently. The existing tests only cover two pure-function helpers (extract_company_fields, normalize_company_id, _extract_*_jobs).
  • Priority: High

API Layer: No Integration Tests

  • What's not tested: All FastAPI route handlers, auth flow, permission enforcement, error responses.
  • Files: app/api/v1/ (all modules)
  • Risk: Auth bypass regressions, schema changes breaking clients, and incorrect HTTP status codes go undetected.
  • Priority: High

Spider Logic: No Unit Tests

  • What's not tested: IPAnomalyDetector.detect(), SmartIPManager state transitions, response parsing in BossZhipinAPI, QcwyApi, ZhilianApi.
  • Files: jobs_spider/boss/boos_api.py, jobs_spider/qcwy/qcwy.py, jobs_spider/zhilian/zhilian_single.py
  • Risk: Anti-detection logic breaks silently when platforms change their response format.
  • Priority: Medium

ClickHouse Schema: No Schema Evolution Tests

  • What's not tested: Behavior of clickhouse_init.py when tables already exist with old schemas; column migration paths.
  • Files: app/core/clickhouse_init.py
  • Risk: Production schema drift goes unnoticed until a query fails at runtime.
  • Priority: Medium

Known Issues / Observed Bugs

Silent Error Swallowing in Spider Code

  • Symptoms: Data silently not collected; no error surfaced to caller.
  • Files: jobs_spider/qcwy/qcwy.py (lines 178179, 191192, 10471048, 11461147), jobs_spider/qcwy/search_company_jobs.py (lines 240241, 512513, 538539, 601602), jobs_spider/boss/boos_api.py (lines 400, 407408, 508509, 518519, 17791780), jobs_spider/zhilian/company_spider.py (many except: pass blocks)
  • Trigger: Network errors, parsing errors, or unexpected response shapes in spider code are caught by bare except Exception: pass blocks.
  • Workaround: Check logs for INFO-level skips; monitor push success counts.

_send_email Uses Blocking SMTP in Async Context

  • Symptoms: Scheduler event loop blocked for up to several seconds during email send.
  • Files: app/core/scheduler.py (lines 308336)
  • Trigger: Any stats_job or ip_alert_job execution that sends email calls synchronous smtplib.SMTP from the async event loop without wrapping in asyncio.to_thread.
  • Workaround: Email delivery may be delayed; event loop latency spikes during scheduled runs.

cleanup_old_records Deletes All Done/Failed Records Without Archive

  • Symptoms: Historical processing data is permanently deleted daily; no audit trail of what was cleaned.
  • Files: app/services/company_cleaner.py (lines 329331), app/core/scheduler.py (line 220)
  • Trigger: daily_cleanup_job calls CompanyCleaningQueue.filter(status__in=["done", "failed"]).delete() — a hard delete with no archiving or retention window.

Concerns audit: 2026-03-21