v0.8.0:4 - vLLM deep-health: 'no model loaded' is idle, not a wedge
Previously a ConnectError on /v1/models classified vLLM as failing, which would feed into the wedge auto-restart heuristic. But when no model is loaded (the normal idle state between swaps, or after a failed swap leaves the vllm_node container up with no process serving), nothing is listening on 8888 — that's by design, not a wedge. The vLLM probe now does a two-step check: 1. GET /v1/models. ConnectError or empty list -> ok=true with note='no model currently loaded (idle)'. No auto-restart triggered (it wouldn't help anyway — restarting vllm_node kills any loaded model and doesn't load a new one). 2. If a model is loaded, POST 1-token chat completion. A 5xx here is a genuine wedge worth restarting for. Result: deep-health correctly reports 'no model loaded' as informational rather than flagging it as a failure. Auto-restart for vLLM only fires when a model is actually loaded AND inference fails — the right semantics.
This commit is contained in:
@@ -173,16 +173,38 @@ class DeepHealth:
|
|||||||
if not s.spark1_host:
|
if not s.spark1_host:
|
||||||
return ProbeResult(ok=False, at=now_iso, error="not configured")
|
return ProbeResult(ok=False, at=now_iso, error="not configured")
|
||||||
base = f"http://{s.spark1_host}:{s.vllm_port}"
|
base = f"http://{s.spark1_host}:{s.vllm_port}"
|
||||||
|
# Step 1: is there a model loaded?
|
||||||
try:
|
try:
|
||||||
async with httpx.AsyncClient(timeout=5.0) as c:
|
async with httpx.AsyncClient(timeout=5.0) as c:
|
||||||
r = await c.get(f"{base}/v1/models")
|
r = await c.get(f"{base}/v1/models")
|
||||||
r.raise_for_status()
|
if 200 <= r.status_code < 300:
|
||||||
models = r.json().get("data") or []
|
models = r.json().get("data") or []
|
||||||
if not models:
|
else:
|
||||||
return ProbeResult(ok=False, at=now_iso, error="no model loaded")
|
# 5xx on /v1/models suggests something wedged after a model loaded
|
||||||
model_id = models[0]["id"]
|
return ProbeResult(
|
||||||
except Exception as e:
|
ok=False,
|
||||||
return ProbeResult(ok=False, at=now_iso, error=f"list models: {type(e).__name__}: {e}")
|
at=now_iso,
|
||||||
|
error=f"list_models HTTP {r.status_code}: {r.text[:240]}",
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
# Connection refused / timeout: usually means no vLLM process listening
|
||||||
|
# (the vllm_node container is alive but no `vllm serve` is running yet).
|
||||||
|
# That's an idle state, not a wedge — don't trigger auto-restart.
|
||||||
|
return ProbeResult(
|
||||||
|
ok=True,
|
||||||
|
at=now_iso,
|
||||||
|
note="no model currently loaded (idle)",
|
||||||
|
)
|
||||||
|
|
||||||
|
if not models:
|
||||||
|
return ProbeResult(
|
||||||
|
ok=True,
|
||||||
|
at=now_iso,
|
||||||
|
note="no model currently loaded (idle)",
|
||||||
|
)
|
||||||
|
|
||||||
|
model_id = models[0]["id"]
|
||||||
|
# Step 2: model is loaded; verify it can actually complete a 1-token request.
|
||||||
t0 = time.monotonic()
|
t0 = time.monotonic()
|
||||||
try:
|
try:
|
||||||
async with httpx.AsyncClient(timeout=PROBE_TIMEOUT_SEC) as c:
|
async with httpx.AsyncClient(timeout=PROBE_TIMEOUT_SEC) as c:
|
||||||
@@ -197,7 +219,7 @@ class DeepHealth:
|
|||||||
)
|
)
|
||||||
latency = round((time.monotonic() - t0) * 1000)
|
latency = round((time.monotonic() - t0) * 1000)
|
||||||
if 200 <= r.status_code < 300:
|
if 200 <= r.status_code < 300:
|
||||||
return ProbeResult(ok=True, at=now_iso, latency_ms=latency)
|
return ProbeResult(ok=True, at=now_iso, latency_ms=latency, note=f"model={model_id}")
|
||||||
return ProbeResult(
|
return ProbeResult(
|
||||||
ok=False,
|
ok=False,
|
||||||
at=now_iso,
|
at=now_iso,
|
||||||
|
|||||||
@@ -1,7 +1,7 @@
|
|||||||
import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk'
|
import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk'
|
||||||
|
|
||||||
export const v0_1_0 = VersionInfo.of({
|
export const v0_1_0 = VersionInfo.of({
|
||||||
version: '0.8.0:3',
|
version: '0.8.0:4',
|
||||||
releaseNotes: {
|
releaseNotes: {
|
||||||
en_US:
|
en_US:
|
||||||
'v0.8: deep health probes. Every 5 minutes, Spark Control sends a tiny synthetic inference request to each service (1 second of silent audio to Parakeet, short text to Magpie, 1-token completion to vLLM). All payloads are generated in-memory and never written to disk. If a probe returns CUDA-error / 5xx signals while the container is still "up" — i.e. the classic Triton-wedge pattern where /health stays green but real inference fails — Spark Control automatically restarts the affected container. Rate-limited to 3 auto-restarts per service per 30 minutes. Each service card now shows the last deep-check timestamp, latency, and an inline "Run now" button. Failures and recoveries are logged into the connectivity history with source=deep-health.',
|
'v0.8: deep health probes. Every 5 minutes, Spark Control sends a tiny synthetic inference request to each service (1 second of silent audio to Parakeet, short text to Magpie, 1-token completion to vLLM). All payloads are generated in-memory and never written to disk. If a probe returns CUDA-error / 5xx signals while the container is still "up" — i.e. the classic Triton-wedge pattern where /health stays green but real inference fails — Spark Control automatically restarts the affected container. Rate-limited to 3 auto-restarts per service per 30 minutes. Each service card now shows the last deep-check timestamp, latency, and an inline "Run now" button. Failures and recoveries are logged into the connectivity history with source=deep-health.',
|
||||||
|
|||||||
Reference in New Issue
Block a user