v0.8.0:4 - vLLM deep-health: 'no model loaded' is idle, not a wedge

Previously a ConnectError on /v1/models classified vLLM as failing, which would feed into the wedge auto-restart heuristic. But when no model is loaded (the normal idle state between swaps, or after a failed swap leaves the vllm_node container up with no process serving), nothing is listening on 8888 — that's by design, not a wedge. The vLLM probe now does a two-step check: 1. GET /v1/models. ConnectError or empty list -> ok=true with note='no model currently loaded (idle)'. No auto-restart triggered (it wouldn't help anyway — restarting vllm_node kills any loaded model and doesn't load a new one). 2. If a model is loaded, POST 1-token chat completion. A 5xx here is a genuine wedge worth restarting for. Result: deep-health correctly reports 'no model loaded' as informational rather than flagging it as a failure. Auto-restart for vLLM only fires when a model is actually loaded AND inference fails — the right semantics.
2026-05-12 14:50:00 -05:00
parent 8ac455f5f5
commit 1602b3b3b4
2 changed files with 30 additions and 8 deletions
@@ -173,16 +173,38 @@ class DeepHealth:
        if not s.spark1_host:
            return ProbeResult(ok=False, at=now_iso, error="not configured")
        base = f"http://{s.spark1_host}:{s.vllm_port}"
+        # Step 1: is there a model loaded?
        try:
            async with httpx.AsyncClient(timeout=5.0) as c:
                r = await c.get(f"{base}/v1/models")
-                r.raise_for_status()
+            if 200 <= r.status_code < 300:
                models = r.json().get("data") or []
-            if not models:
-                return ProbeResult(ok=False, at=now_iso, error="no model loaded")
-            model_id = models[0]["id"]
-        except Exception as e:
-            return ProbeResult(ok=False, at=now_iso, error=f"list models: {type(e).__name__}: {e}")
+            else:
+                # 5xx on /v1/models suggests something wedged after a model loaded
+                return ProbeResult(
+                    ok=False,
+                    at=now_iso,
+                    error=f"list_models HTTP {r.status_code}: {r.text[:240]}",
+                )
+        except Exception:
+            # Connection refused / timeout: usually means no vLLM process listening
+            # (the vllm_node container is alive but no `vllm serve` is running yet).
+            # That's an idle state, not a wedge — don't trigger auto-restart.
+            return ProbeResult(
+                ok=True,
+                at=now_iso,
+                note="no model currently loaded (idle)",
+            )
+
+        if not models:
+            return ProbeResult(
+                ok=True,
+                at=now_iso,
+                note="no model currently loaded (idle)",
+            )
+
+        model_id = models[0]["id"]
+        # Step 2: model is loaded; verify it can actually complete a 1-token request.
        t0 = time.monotonic()
        try:
            async with httpx.AsyncClient(timeout=PROBE_TIMEOUT_SEC) as c:
@@ -197,7 +219,7 @@ class DeepHealth:
                )
            latency = round((time.monotonic() - t0) * 1000)
            if 200 <= r.status_code < 300:
-                return ProbeResult(ok=True, at=now_iso, latency_ms=latency)
+                return ProbeResult(ok=True, at=now_iso, latency_ms=latency, note=f"model={model_id}")
            return ProbeResult(
                ok=False,
                at=now_iso,