Files
spark-control/package
Grant 000c55febe v0.8.0 - Deep health probes + auto-restart on CUDA wedge
deep_health.py:
- Synthetic probes per service, all payloads generated in-memory (BytesIO), never written to disk:
  - Parakeet: 1s of digital silence via in-memory WAV → POST /v1/audio/transcriptions
  - Magpie:   short 'hi' text → POST /v1/audio/synthesize (multipart form-data, real TTS API endpoint discovered via openapi.json)
  - vLLM:     1-token completion against currently-loaded model
- Background loop runs every 5 minutes (configurable). Best-effort: exceptions in the loop never kill it.
- Auto-restart on wedge-pattern errors (cudaErrorUnknown / CUFFT_INTERNAL_ERROR / 500 / Engine core init failed): docker restart of the affected container.
  - Rate-limited: max 3 restarts per service per 30 min.
  - Cooldown: 120 s between consecutive restarts on the same service.
  - 60 s startup grace before any auto-restart can fire after the app boots.
- Probe failures + recoveries logged via record_report(source='deep-health') into the connectivity history alongside the polling-based transitions.

API:
- GET /api/deep-health: per-service last result + auto-restart counters
- POST /api/deep-health/{service}/run: manual trigger now

UI:
- Service cards show 'Deep check ok/FAILED <time> <latency>' inline, plus a ↻ button to run-now
- Auto-restart count in 30-min window surfaced on the card when > 0
- Inline error excerpt shown for failed probes

Bug fix: server.py app startup hook was placed before the FastAPI app object was constructed (would crash on import). Moved after.
2026-05-12 14:41:01 -05:00
..
2026-05-12 09:52:53 -05:00