B. Runtime Reliability¶

Parent: Code Review Index Priority: blocks scaling beyond one concurrent reviewer Surface affected: backend (server/), Dockerfile, CI

These findings decide whether the hosted runtime behaves correctly under light pilot load. Most are one-shot fixes, not architectural rewrites.

B1 — JSON file stores have no locking or atomic writes¶

Status: Open
Impact: H | Complexity: M | Time: 1 W
Files: server/review_store.py:38-44, server/model_store.py, server/dfm_review_jobs.py, server/dfm_candidate_session_store.py, server/analysis_runs.py

Finding. Stores read and write JSON files via naïve open()/json.dump() patterns. No atomic rename, no fcntl lock, no versioning. Two concurrent requests for the same model can interleave reads and writes and corrupt state. The wiki promises an "immutable activity log with audit trail" (MVP v0 spec) — that claim cannot be made until this is fixed.

Action for Codex. 1. Introduce server/_atomic_json.py with: - read_json(path) -> dict - write_json_atomic(path, data) — write to path.tmp, fsync, os.replace(path.tmp, path). - with_file_lock(path) — context manager using filelock (cross-platform) keyed on path + ".lock". 2. Refactor each _read_store/_write_store pair in the five store modules to use the helpers, holding the lock around the read-modify-write sequence. 3. Add a concurrency regression test that spawns 10 threads each calling create_ticket on the same model; assert the final count equals 10. 4. Document the per-store lock scope in a short docs/contracts/persistence.md.

Acceptance criteria. - Regression test passes repeatedly (run in CI under pytest -n auto). - write_json_atomic is the only writer of reviews.json, models.json, etc.

Depends on: none. Unlocks F1.

B2 — Async route handlers do blocking CAD work on the event loop¶

Status: Open
Impact: H | Complexity: M | Time: 1 W
Files: server/main.py (every async def handler that calls cad_service.*, part_facts_service.*, dfm_review_v2.*, etc.), server/cad_service.py, server/cad_service_occ.py, server/part_facts.py

Finding. Every route handler is async def, but the services they call are synchronous and do blocking CPU/IO work (FreeCAD, OpenCascade, numpy, trimesh). Each such call stalls the event loop for the entire request, capping concurrency at ~1–2 requests before p50 degrades sharply.

Action for Codex. 1. Identify the set of handlers that do blocking work. Use grep -nE "async def.*(review|dfm|cad|vision|part_facts|cnc|draftlint|canonical)" as a starting list. 2. For each blocking call, wrap it in await asyncio.to_thread(...) or await run_in_threadpool(...) (FastAPI re-export). Do not convert handlers to def — we want to keep async for request parsing and streaming responses. 3. Tune uvicorn workers and the threadpool size: expose UVICORN_WORKERS and STARLETTE_THREADPOOL_SIZE env vars. 4. Add a minimal load test (scripts/loadtest_dfm.py) using httpx.AsyncClient that issues 10 concurrent review requests and measures p50/p95 wall-clock — baseline before and after.

Acceptance criteria. - Under 10 concurrent /api/models/{id}/dfm/review-v2 requests on a dev machine, event-loop latency on the /healthz endpoint stays under 100 ms (measure with a sidecar loop). - No handler performs a synchronous cad_service.* or part_facts_service.* call in the request coroutine.

Depends on: B4 (/healthz) helps verify.

B3 — Jobs run as bare daemon threads; no durability, retry, or resume¶

Status: Open
Impact: H | Complexity: M | Time: 1 W
Files: server/dfm_review_jobs.py:209-280, server/part_facts_warmup.py, server/dfm_candidate_geometry_worker.py

Finding. DfmReviewJobCoordinator.enqueue spawns a threading.Thread(daemon=True) per request. If the process crashes or restarts, the job is silently lost and the job-status JSON is left in whatever state it last flushed. No retry. No dead-letter. No observability.

Action for Codex. Two acceptable paths. Pick one explicitly and document the choice in docs/contracts/jobs.md:

Path 1 — Minimal, no new infra. 1. Persist a "job intent" JSON before starting the thread (status queued, payload, created_at, attempts). 2. On process startup, scan the jobs folder for queued or running status entries and either resume them or mark them failed_due_to_restart with a clear reason. 3. Add exponential-backoff retry (max 2 attempts) around the thread body for retryable_exceptions. 4. Expose /api/admin/jobs (authed) that lists job statuses from disk.

Path 2 — Real job queue. 1. Introduce arq (Redis-backed, asyncio-native) or RQ. 2. Move job payloads to the queue; keep the on-disk status as a cache for fast lookups. 3. Run a worker in a sidecar container; update Dockerfile + entrypoint accordingly.

Acceptance criteria. - Restarting the backend mid-job leaves no orphaned running status (Path 1) or resumes the job (Path 2). - At least one retryable failure is retried successfully in a test.

Depends on: B1 (stores must be lock-safe before durability matters).

B4 — No real `/healthz` / `/readyz`¶

Status: Open
Impact: M | Complexity: L | Time: 1 D
Files: server/main.py (app-level routes), server/freecad_setup.py, _OPTIONAL_SERVICE_STARTUP_ERRORS in main.py:158

Finding. cad_service_occ, vision_analysis_service, and fusion_analysis_service can silently fail to start; the failure is stored in _OPTIONAL_SERVICE_STARTUP_ERRORS and logged once. There is no health endpoint for Cloudflare / Railway / uptime pingers to discover a degraded deploy.

Action for Codex. 1. Add GET /healthz (cheap, always 200 if the process is alive) — public, no auth. 2. Add GET /readyz — checks that cad_service, canonical_scene_service, and the storage paths are writable; reports _OPTIONAL_SERVICE_STARTUP_ERRORS in the body. Returns 200 when core services are up, 503 otherwise. 3. Add GET /api/status (authed) — returns full runtime diagnostics (as _log_cad_runtime_diagnostics logs today). 4. Configure Dockerfile HEALTHCHECK against /healthz.

Acceptance criteria. - Tests for all three endpoints. - docker inspect shows the healthcheck passing. - Introducing a failure in cad_service_occ startup results in /readyz returning 503 with a descriptive body.

Depends on: none.

B5 — `requirements.txt` is unpinned; no lock file¶

Status: Open
Impact: H | Complexity: L | Time: 1 D
Files: server/requirements.txt, server/requirements_pythonocc.txt (also corrupted — see E7), .github/workflows/dfm-vision-fusion-gate.yml

Finding. Every dependency is floating. FastAPI, pydantic, numpy, trimesh, reportlab — all resolve to whatever pip picks today. Today's green CI does not guarantee tomorrow's green CI. Reproducible pilot builds are impossible.

Action for Codex. 1. Introduce uv (fast, simple) or pip-tools. Keep requirements.in with the direct deps; generate a fully pinned requirements.txt. 2. Commit the lock file; pin Python to 3.11.x explicitly. 3. Update the Dockerfile to install from the locked file. 4. Update CI to install from the locked file. 5. Add a scripts/upgrade_deps.sh that regenerates the lock and prints a diff.

Acceptance criteria. - pip install -r server/requirements.txt on a clean env always yields the same resolved versions. - CI installs from the lock file.

Depends on: none. Paired with E2, E4.

B6 — Docker container runs as root¶

Status: Open
Impact: M | Complexity: L | Time: 0.5 D
Files: Dockerfile (last USER root block), docker-entrypoint.sh

Finding. The image ends with USER root after chown'ing volume paths to ${MAMBA_USER}, and the CMD runs via micromamba run without re-dropping privileges. Running Python as root inside a container exposes the kernel attack surface for any in-process vulnerability.

Action for Codex. 1. Perform the mkdir/chown under USER root, then switch back to USER ${MAMBA_USER} before the CMD. 2. Ensure the entrypoint does not need root — move the chown -R of MODELS_DIR/PROCESS_DIR into the image build if possible, or to a startup script that runs as root only if a PREPARE_DIRS=1 flag is set. 3. Verify via docker run rapiddraft id -u → non-zero.

Acceptance criteria. - Inside the running container whoami is not root. - Uploads and DFM reviews continue to work.

Depends on: none.

B7 — No structured logging, metrics, or tracing¶

Status: Open
Impact: M | Complexity: M | Time: 3 D
Files: server/main.py, new middleware module

Finding. Logging is plain string formatting via logging.getLogger(__name__). No request IDs, no durations, no structured fields, no metrics endpoint, no tracing. Troubleshooting a slow pilot review means searching free-text logs.

Action for Codex. 1. Add a request ID middleware: generate ULID, put on request.state.request_id, include in logs and response headers. 2. Configure structlog or logging with a JSON formatter; emit one log line per request with method, path, status, duration_ms, request_id, identity_email if present. 3. Add /metrics (authed or IP-allowlisted) using prometheus_client. Counters for requests, histograms for duration, gauge for in-flight jobs. 4. Optionally wire OpenTelemetry for distributed tracing — defer if a backend is not yet chosen.

Acceptance criteria. - Every response carries X-Request-ID. - Log lines are valid JSON and contain request_id, path, duration_ms, status. - /metrics returns Prometheus text format.

Depends on: none.

B. Runtime Reliability¶

B1 — JSON file stores have no locking or atomic writes¶

B2 — Async route handlers do blocking CAD work on the event loop¶

B3 — Jobs run as bare daemon threads; no durability, retry, or resume¶

B4 — No real /healthz / /readyz¶

B5 — requirements.txt is unpinned; no lock file¶

B6 — Docker container runs as root¶

B7 — No structured logging, metrics, or tracing¶

B4 — No real `/healthz` / `/readyz`¶

B5 — `requirements.txt` is unpinned; no lock file¶