v0.8.0 - Deep health probes + auto-restart on CUDA wedge
deep_health.py:
- Synthetic probes per service, all payloads generated in-memory (BytesIO), never written to disk:
- Parakeet: 1s of digital silence via in-memory WAV → POST /v1/audio/transcriptions
- Magpie: short 'hi' text → POST /v1/audio/synthesize (multipart form-data, real TTS API endpoint discovered via openapi.json)
- vLLM: 1-token completion against currently-loaded model
- Background loop runs every 5 minutes (configurable). Best-effort: exceptions in the loop never kill it.
- Auto-restart on wedge-pattern errors (cudaErrorUnknown / CUFFT_INTERNAL_ERROR / 500 / Engine core init failed): docker restart of the affected container.
- Rate-limited: max 3 restarts per service per 30 min.
- Cooldown: 120 s between consecutive restarts on the same service.
- 60 s startup grace before any auto-restart can fire after the app boots.
- Probe failures + recoveries logged via record_report(source='deep-health') into the connectivity history alongside the polling-based transitions.
API:
- GET /api/deep-health: per-service last result + auto-restart counters
- POST /api/deep-health/{service}/run: manual trigger now
UI:
- Service cards show 'Deep check ok/FAILED <time> <latency>' inline, plus a ↻ button to run-now
- Auto-restart count in 30-min window surfaced on the card when > 0
- Inline error excerpt shown for failed probes
Bug fix: server.py app startup hook was placed before the FastAPI app object was constructed (would crash on import). Moved after.
This commit is contained in:
@@ -622,6 +622,19 @@ main {
|
||||
.service-card .row .v.copyable.copied { outline: 1px solid var(--accent); background: rgba(74, 222, 128, 0.05); }
|
||||
.service-card .row .icon-btn { padding: 3px 6px; }
|
||||
.service-card .row .icon-btn svg { width: 12px; height: 12px; }
|
||||
.service-card .deep-row .deep-v { display: flex; align-items: center; gap: 6px; font-family: inherit; flex-wrap: wrap; }
|
||||
.service-card .dh-ok { color: var(--accent); }
|
||||
.service-card .dh-fail { color: var(--error); font-weight: 500; }
|
||||
.service-card .dh-run-btn { font-family: inherit; }
|
||||
.service-card .deep-error {
|
||||
padding: 4px 8px;
|
||||
background: rgba(239, 68, 68, 0.06);
|
||||
border-left: 2px solid var(--error);
|
||||
border-radius: 4px;
|
||||
font-family: ui-monospace, SFMono-Regular, Menlo, monospace;
|
||||
font-size: 11px;
|
||||
word-break: break-word;
|
||||
}
|
||||
|
||||
.service-actions {
|
||||
display: flex;
|
||||
|
||||
Reference in New Issue
Block a user