v0.8.0 - Deep health probes + auto-restart on CUDA wedge
deep_health.py:
- Synthetic probes per service, all payloads generated in-memory (BytesIO), never written to disk:
- Parakeet: 1s of digital silence via in-memory WAV → POST /v1/audio/transcriptions
- Magpie: short 'hi' text → POST /v1/audio/synthesize (multipart form-data, real TTS API endpoint discovered via openapi.json)
- vLLM: 1-token completion against currently-loaded model
- Background loop runs every 5 minutes (configurable). Best-effort: exceptions in the loop never kill it.
- Auto-restart on wedge-pattern errors (cudaErrorUnknown / CUFFT_INTERNAL_ERROR / 500 / Engine core init failed): docker restart of the affected container.
- Rate-limited: max 3 restarts per service per 30 min.
- Cooldown: 120 s between consecutive restarts on the same service.
- 60 s startup grace before any auto-restart can fire after the app boots.
- Probe failures + recoveries logged via record_report(source='deep-health') into the connectivity history alongside the polling-based transitions.
API:
- GET /api/deep-health: per-service last result + auto-restart counters
- POST /api/deep-health/{service}/run: manual trigger now
UI:
- Service cards show 'Deep check ok/FAILED <time> <latency>' inline, plus a ↻ button to run-now
- Auto-restart count in 30-min window surfaced on the card when > 0
- Inline error excerpt shown for failed probes
Bug fix: server.py app startup hook was placed before the FastAPI app object was constructed (would crash on import). Moved after.
This commit is contained in:
@@ -12,6 +12,7 @@ from typing import Literal
|
||||
from .config import Settings
|
||||
from .connectivity import get_mac, record_report, record_state, summary as connectivity_summary
|
||||
from .custom_services import add_custom_service, delete_custom_service
|
||||
from .deep_health import DeepHealth
|
||||
from .download import DownloadManager
|
||||
from .hardware import HardwareProbe
|
||||
from .health import check_magpie, check_parakeet, check_vllm
|
||||
@@ -33,9 +34,22 @@ download_manager = DownloadManager(settings)
|
||||
update_manager = UpdateManager(settings)
|
||||
hardware_probe = HardwareProbe(settings)
|
||||
nim_manager = NimManager(settings)
|
||||
deep_health = DeepHealth(settings)
|
||||
|
||||
app = FastAPI(title="spark-control", version="0.1.0")
|
||||
|
||||
|
||||
@app.on_event("startup")
|
||||
async def _start_deep_health() -> None:
|
||||
# Fire-and-forget; the loop catches its own exceptions.
|
||||
asyncio.create_task(deep_health.run_periodic())
|
||||
|
||||
|
||||
@app.on_event("shutdown")
|
||||
async def _stop_deep_health() -> None:
|
||||
deep_health.stop()
|
||||
|
||||
|
||||
_STATIC_DIR = Path(__file__).resolve().parent / "static"
|
||||
app.mount("/static", StaticFiles(directory=_STATIC_DIR), name="static")
|
||||
|
||||
@@ -137,6 +151,27 @@ async def get_connectivity() -> dict:
|
||||
return connectivity_summary()
|
||||
|
||||
|
||||
@app.get("/api/deep-health")
|
||||
async def get_deep_health() -> dict:
|
||||
"""Last result + auto-restart counters for each service's synthetic probe."""
|
||||
return deep_health.summary()
|
||||
|
||||
|
||||
@app.post("/api/deep-health/{service}/run")
|
||||
async def run_deep_health(service: str) -> dict:
|
||||
"""Manually run a single service's deep-health probe right now."""
|
||||
if service not in deep_health.PROBES:
|
||||
raise HTTPException(404, f"unknown service: {service}")
|
||||
result = await deep_health.run_one(service)
|
||||
return {
|
||||
"ok": result.ok,
|
||||
"at": result.at,
|
||||
"latency_ms": result.latency_ms,
|
||||
"error": result.error,
|
||||
"note": result.note,
|
||||
}
|
||||
|
||||
|
||||
class HealthEventBody(BaseModel):
|
||||
service: str # e.g. "parakeet", "magpie", "vllm"
|
||||
ok: bool # true on success, false on failure
|
||||
|
||||
Reference in New Issue
Block a user