win 00a727519f docs(phase-3): complete execution — 2/2 plans, 98 tests passing

- ARCH-04: job51 migrated to crawler_core (no old deps)
- ARCH-05: zhilian migrated to crawler_core (no old deps)
- 34 new mock tests (17 job51 + 17 zhilian)
- Added _parse_zhilian_response custom parser for zhilian API format
- Fixed POST Searcher _request() overrides for job51/zhilian
- Full regression: 98 passed in 0.12s

2026-03-21 19:19:17 +08:00

16 KiB

Raw Blame History

Codebase Concerns

Analysis Date: 2026-03-21

Security Considerations

CRITICAL: Hardcoded Credentials in Source Code

Area: Configuration

Risk: Real database passwords, SMTP credentials, and IP addresses committed to version control.
Files: app/settings/config.py
Exposed values (lines 27–52):
- MySQL root password (jobdata123) with public IP 121.4.126.241:3306
- ClickHouse user/pass defaults (data_user / data_pass) with same public IP
- QQ SMTP app password (jkncedjzkummbbdi) and recipient email addresses
Current mitigation: pydantic-settings allows environment variable override, but defaults are live credentials that will be exposed if the repo is shared or CI logs are leaked.
Recommendations: Remove all real default values; replace with None or clearly fake placeholders; enforce required-field validation at startup; rotate exposed credentials immediately.

CRITICAL: Unauthenticated Data Ingest Endpoints

Area: API Authentication

Risk: Any actor with network access can push arbitrary data directly into ClickHouse.
Files: app/api/v1/__init__.py (lines 32–34), app/api/v1/job/job.py
The /ingest, /job, and /universal router groups are mounted without any DependPermission or DependAuth dependency. The data store, batch-store, and batch-store-async endpoints accept unauthenticated POST requests.
Trigger: Send a POST to /api/v1/universal/data/batch-store-async with any payload. No token required.
Workaround: The CLAUDE.md acknowledges these as "internal call" endpoints, but they are publicly reachable with no network-level isolation noted.
Recommendations: Add at minimum IP allowlist middleware or a static shared secret header check for ingest routes; or mount them behind DependAuth.

HIGH: Dev Token Bypass in Production Auth

Area: Authentication

Risk: If APP_ENV=development, any request with token: dev header is granted the first user's full privileges.
Files: app/core/dependency.py (lines 28–30)
The check os.getenv("APP_ENV", "production") == "development" defaults to "production", but there is no enforcement that this variable is set correctly in production deployments. A misconfigured environment would expose the bypass.
Recommendations: Remove the dev bypass entirely; use test fixtures instead; add a startup assertion that the bypass is disabled in non-dev environments.

HIGH: SQL Injection via f-string Queries

Area: ClickHouse Queries

Risk: User-controlled values inserted directly into query strings allow injection.
Files:
- reclean_qcwy_jobs.py (line 34): f"SELECT json_data FROM job_data.qcwy_job WHERE job_id = '{job_id}' LIMIT 1" — job_id comes from URL parsing of user-provided input.
- app/repositories/clickhouse_repo.py (lines 101–108): group_by_column parameter is interpolated directly into the SELECT and GROUP BY clauses without validation.
- app/services/ingest/service.py (lines 109, 112): table value is derived from config (lower risk) but still concatenated into f-strings.
- app/core/scheduler.py (lines 57, 63): table names from a fixed list, but pattern is risky.
Current mitigation: ClickHouse parameterized queries are used in dedup.py, but inconsistently applied elsewhere.
Recommendations: Use parameterized queries (parameters={}) for all user-influenced values; validate group_by_column against an allowlist; never interpolate user input into SQL strings.

MEDIUM: CORS Misconfiguration

Area: CORS Policy

Risk: CORS_ALLOW_CREDENTIALS = False while CORS_ALLOW_HEADERS = ["*"] is set. If credentials are later enabled, wildcard headers become a security liability.
Files: app/settings/config.py (lines 13–16)
Recommendations: Restrict CORS_ORIGINS to known production domains; avoid wildcard headers.

MEDIUM: Hardcoded Public IP in Spider Default

Area: Spider Configuration

Risk: Hardcoded IP 124.222.106.226:9999 as the default API_BASE_URL in all spider scripts and spiderJobs/ runner. If that IP changes hands or is exposed, all crawled data goes to an unintended recipient.
Files:
- jobs_spider/boss/boos_api.py (line 17)
- jobs_spider/qcwy/qcwy.py (line 28)
- jobs_spider/zhilian/zhilian_single.py (line 41)
- spiderJobs/runner/api_client.py (line 31)
Recommendations: Remove hardcoded IP defaults; require API_BASE_URL to be explicitly set via environment variable with no default.

Tech Debt

Duplicate Spider Implementations

Area: Code Organization

Issue: Two parallel spider hierarchies exist for the same three platforms.
- jobs_spider/boss/, jobs_spider/qcwy/, jobs_spider/zhilian/ — original scripts
- spiderJobs/platforms/boss/, spiderJobs/platforms/job51/, spiderJobs/platforms/zhilian/ — refactored version
- app/services/crawler/ — yet another crawler abstraction inside the app
Files: jobs_spider/, spiderJobs/, app/services/crawler/
Impact: Bug fixes and anti-detection improvements must be applied in multiple places; the relationship between these directories is not documented.
Fix approach: Decide on one canonical implementation; deprecate and delete the others; update CLAUDE.md to clarify which is the active spider.

God File: `jobs_spider/boss/boos_api.py`

Area: File Size / Complexity

Issue: Single file containing 2281 lines covering API client, IP strategy, anomaly detection, session management, main loop, and push logic.
Files: jobs_spider/boss/boos_api.py
Impact: Extremely difficult to test or modify any single responsibility; cognitive load is very high.
Fix approach: Extract SmartIPManager, IPAnomalyDetector, IPStrategyConfig into separate modules; separate the main crawl loop from the API client class.

`CleaningService` and `CompanyCleaner` Code Duplication

Area: Service Layer

Issue: Both classes replicate identical token-loading logic (_ensure_boss_token_loaded), proxy-setting logic (_apply_proxy), and service instantiation (BossService, QcwyService, ZhilianService).
Files:
- app/services/cleaning.py (lines 37–48, 31–34)
- app/services/company_cleaner.py (lines 60–74, 54–58)
Impact: Token refresh logic is maintained in two places; a bug fix in one will not propagate to the other.
Fix approach: Extract a shared CrawlerContext or CrawlerMixin class providing token management and proxy configuration.

ClickHouse Schema Changes Require Manual DDL

Area: Database Migrations

Issue: ClickHouse table definitions in app/core/clickhouse_init.py use CREATE TABLE IF NOT EXISTS, meaning schema changes (new columns, index changes) are never automatically applied to existing tables. There is no migration tooling equivalent to Aerich for ClickHouse.
Files: app/core/clickhouse_init.py
Impact: If a column is added to the DDL, existing production tables silently retain the old schema; queries referencing new columns will fail at runtime.
Fix approach: Implement ALTER TABLE ... ADD COLUMN IF NOT EXISTS statements for schema evolution; or adopt a ClickHouse migration tool (e.g., ch-migrations).

Startup File Lock Not Robust to Crashes

Area: Multi-Worker Initialization

Issue: init_data() uses os.mkdir(".startup_lock") as a bare file lock with no TTL. If a worker crashes mid-initialization, the lock directory is never cleaned up, and subsequent restarts will skip all migrations and seed data.
Files: app/core/init_app.py (lines 322–349)
Impact: Silent data-initialization failure on restart after crash; can result in missing admin user or broken menu state.
Fix approach: Replace with the existing DistributedLock class from app/core/locks.py which has TTL support; or at minimum add a time-based stale-lock cleanup.

Performance Bottlenecks

ClickHouse Singleton Connection Per-Worker

Area: Database Connections

Problem: clickhouse_manager in app/core/clickhouse.py is a module-level singleton that lazily creates one connection. With 20 uvicorn workers (UVICORN_WORKERS=20), each worker maintains its own connection, and the connection pool inside each is limited to maxsize=5. Under concurrent load, requests block waiting for the pool.
Files: app/core/clickhouse.py (lines 31–58)
Cause: A new PoolManager is created on every get_clickhouse_client() call but the singleton pattern means get_client() only calls this once per worker — pool configuration cannot be tuned per endpoint.
Improvement path: Use a connection pool library or connection factory pattern; expose pool size as a configurable setting; consider clickhouse-pool for proper async pooling.

Dedup Query Scans Entire Table

Area: ClickHouse Deduplication

Problem: batch_dedup_filter in app/services/ingest/dedup.py issues SELECT key_col FROM table WHERE key_col IN (...) with no time-bound filter. For large tables this is a full-table scan on each ingest batch.
Files: app/services/ingest/dedup.py (lines 51, 65)
Cause: No PREWHERE created_at > now() - INTERVAL N DAY constraint added.
Improvement path: Add a configurable recency window to dedup queries (e.g., 30-day lookback matching the existing days_back logic in company_cleaner.py).

Synchronous Crawler Calls Blocking Async Workers

Area: Concurrency

Problem: All crawler service methods (BossService, QcwyService, ZhilianService) use synchronous requests internally, wrapped with asyncio.to_thread(). Under high concurrency, this can exhaust the thread pool.
Files: app/services/cleaning.py (lines 186, 190, 194, etc.), app/services/company_cleaner.py (lines 318, 321, 325)
Cause: The underlying crawler libraries use requests (sync), not httpx (async).
Improvement path: Migrate crawlers to httpx.AsyncClient for true async I/O, or cap thread pool size via asyncio.to_thread semaphore.

Fragile Areas

`group_by_column` API Parameter Lacks Allowlist Validation

Area: Analytics API

Files: app/repositories/clickhouse_repo.py (lines 86–115), app/api/v1/analytics.py
Why fragile: group_by_column is passed directly into the ClickHouse query without validation against allowed column names. An invalid column name causes a runtime ClickHouse error that propagates as a 500 to the client; a malicious value enables injection.
Safe modification: Add a ALLOWED_GROUP_COLUMNS = frozenset({"source", "city", "channel", ...}) check before query construction.
Test coverage: No tests for analytics query building.

`DistributedLock` File Lock Has Race Window

Area: Distributed Locking

Files: app/core/locks.py (lines 64–92)
Why fragile: The stale-lock cleanup path (lines 81–87) involves shutil.rmtree followed by mkdir — a TOCTOU race where two workers can both detect staleness, both delete the directory, and both attempt to re-create it. One wins, one raises FileExistsError and falls through to return False, but the winner's lock is not guaranteed to be the intended holder.
Safe modification: Use os.rename from a temp directory as an atomic lock-acquire primitive.
Test coverage: None.

`CleaningService` Returns Mixed Types (`bool | Dict`)

Area: Type Consistency

Files: app/services/cleaning.py (lines 158–165, 167–202, 204–239, 241–268)
Why fragile: Multiple clean_* methods return Union[bool, Dict[str, Any]]. Callers must defensively check isinstance(result, bool) before accessing dict keys (see process_single_item at lines 128–135). This makes the API fragile — any new call site that forgets the bool check will raise AttributeError.
Safe modification: Define a CleanResult dataclass or TypedDict; always return it.
Test coverage: None.

`company_cleaning_job` Assumes 5-Minute Completion

Area: Scheduler

Files: app/core/scheduler.py (lines 171–202)
Why fragile: The job comment explicitly notes it processes 50 collect + 30 process operations and "should complete in 2–3 minutes" to fit within the 5-minute schedule window. If the external crawlers slow down (rate-limited, network issues), the job overruns its window, causing the next scheduled run to be blocked by the distributed lock, leading to queue backlog.
Safe modification: Add timeout guards around collect_pending_companies and process_pending_companies; or dynamically reduce batch sizes based on elapsed time.

Missing Critical Features

No Rate Limiting on Any API Endpoint

Problem: FastAPI has no rate-limiting middleware configured. The data ingest endpoints (which have no auth) and the cleaning trigger endpoints can be called at arbitrary rates.

Blocks: Protection against accidental or malicious flooding of ClickHouse with writes.
Recommendation: Add slowapi or custom middleware with per-IP rate limiting on ingest and crawl-triggering endpoints.

No Input Size Limit on Batch Ingest

Problem: IngestBatchRequest.data_list has no max_items constraint. A single request with tens of thousands of job records will be processed synchronously (or queued as a background task) with no guard on memory consumption.

Files: app/schemas/ingest.py, app/api/v1/job/job.py
Recommendation: Add Field(max_length=1000) or equivalent to data_list; add body size limit in uvicorn/nginx.

Test Coverage Gaps

Service Layer: Virtually Untested

What's not tested: CleaningService, CompanyCleaner, IngestService, AnalyticsService, DistributedLock, ClickHouseManager
Files: app/services/cleaning.py, app/services/company_cleaner.py, app/services/ingest/service.py, app/services/analytics_service.py, app/core/locks.py, app/core/clickhouse.py
Risk: Core data routing, deduplication, and cleaning logic can regress silently. The existing tests only cover two pure-function helpers (extract_company_fields, normalize_company_id, _extract_*_jobs).
Priority: High

API Layer: No Integration Tests

What's not tested: All FastAPI route handlers, auth flow, permission enforcement, error responses.
Files: app/api/v1/ (all modules)
Risk: Auth bypass regressions, schema changes breaking clients, and incorrect HTTP status codes go undetected.
Priority: High

Spider Logic: No Unit Tests

What's not tested: IPAnomalyDetector.detect(), SmartIPManager state transitions, response parsing in BossZhipinAPI, QcwyApi, ZhilianApi.
Files: jobs_spider/boss/boos_api.py, jobs_spider/qcwy/qcwy.py, jobs_spider/zhilian/zhilian_single.py
Risk: Anti-detection logic breaks silently when platforms change their response format.
Priority: Medium

ClickHouse Schema: No Schema Evolution Tests

What's not tested: Behavior of clickhouse_init.py when tables already exist with old schemas; column migration paths.
Files: app/core/clickhouse_init.py
Risk: Production schema drift goes unnoticed until a query fails at runtime.
Priority: Medium

Known Issues / Observed Bugs

Silent Error Swallowing in Spider Code

Symptoms: Data silently not collected; no error surfaced to caller.
Files: jobs_spider/qcwy/qcwy.py (lines 178–179, 191–192, 1047–1048, 1146–1147), jobs_spider/qcwy/search_company_jobs.py (lines 240–241, 512–513, 538–539, 601–602), jobs_spider/boss/boos_api.py (lines 400, 407–408, 508–509, 518–519, 1779–1780), jobs_spider/zhilian/company_spider.py (many except: pass blocks)
Trigger: Network errors, parsing errors, or unexpected response shapes in spider code are caught by bare except Exception: pass blocks.
Workaround: Check logs for INFO-level skips; monitor push success counts.

`_send_email` Uses Blocking SMTP in Async Context

Symptoms: Scheduler event loop blocked for up to several seconds during email send.
Files: app/core/scheduler.py (lines 308–336)
Trigger: Any stats_job or ip_alert_job execution that sends email calls synchronous smtplib.SMTP from the async event loop without wrapping in asyncio.to_thread.
Workaround: Email delivery may be delayed; event loop latency spikes during scheduled runs.

`cleanup_old_records` Deletes All Done/Failed Records Without Archive

Symptoms: Historical processing data is permanently deleted daily; no audit trail of what was cleaned.
Files: app/services/company_cleaner.py (lines 329–331), app/core/scheduler.py (line 220)
Trigger: daily_cleanup_job calls CompanyCleaningQueue.filter(status__in=["done", "failed"]).delete() — a hard delete with no archiving or retention window.

Concerns audit: 2026-03-21

16 KiB Raw Blame History Unescape Escape