v0.4.0 - NIM installer + dashboard resilience

Hotfix (was v0.3.1): - services.py: cache 'unreachable' per (host,user) for 25s so a dead Spark doesn't hang every /api/services call behind 6s ssh timeout - ssh_run timeout reduced 10 -> 6s for docker_state probes - hardware probe: shorter SSH timeout (6s), longer cache TTL for failures (25s) - JS pollStatus retries loadModels() if state.models is empty (recovers from cold-start proxy timeout) - Unreachable hardware card now includes troubleshooting steps (Spark Control cannot SSH into an unreachable Spark to restart it) v0.4 NIM installer: - nim.py module: curated SUGGESTED_NIMS list (Parakeet, Magpie, Riva) + NimManager that runs docker login nvcr.io + docker pull + docker run -d --gpus all -p PORT:PORT -v VOL:/opt/nim/.cache -e NGC_API_KEY -e ... --restart=unless-stopped + chown the volume to uid 1000 + restart. Streams all output via SSE; redacts the API key from log lines. - custom_services.py: persists installed NIMs to /data/services-overrides.yaml so they appear in the services panel after install - services.py: merges custom services into the panel - /api/nim/catalog GET, /api/nim/install POST + GET/SSE - /api/services/{name} DELETE for custom services - UI: '+ Install NIM' button next to 'Always-on services'; modal lists curated images each with a 'Pick' button + a custom-image form; installation runs in a second dialog with phase + elapsed timer + collapsible log - NGC API key field added to Configure Sparks (masked); injected as NGC_API_KEY env var into the container Package: bump 0.4.0:0; main.ts adds SERVICES_OVERRIDES + NGC_API_KEY env vars
2026-05-12 12:32:29 -05:00
parent e88fdcfde4
commit 1889ab45fb
13 changed files with 690 additions and 10 deletions
@@ -84,12 +84,16 @@ def _parse(out: str) -> dict:
 class HardwareProbe:
    """Caches results briefly to avoid hammering the Sparks."""

-    def __init__(self, settings: Settings, ttl_sec: float = 4.0) -> None:
+    def __init__(self, settings: Settings, ttl_sec: float = 4.0, fail_ttl_sec: float = 25.0) -> None:
        self.settings = settings
        self.ttl_sec = ttl_sec
+        self.fail_ttl_sec = fail_ttl_sec
        self._cache: dict[str, tuple[float, dict]] = {}
        self._locks: dict[str, asyncio.Lock] = {}

+    def _ttl_for(self, value: dict) -> float:
+        return self.ttl_sec if value.get("reachable") else self.fail_ttl_sec
+
    def _lock(self, key: str) -> asyncio.Lock:
        if key not in self._locks:
            self._locks[key] = asyncio.Lock()
@@ -108,12 +112,18 @@ class HardwareProbe:
        async with self._lock(key):
            now = time.monotonic()
            cached = self._cache.get(key)
-            if cached and (now - cached[0] < self.ttl_sec):
+            if cached and (now - cached[0] < self._ttl_for(cached[1])):
                return cached[1]
-            rc, out, err = await ssh_run(host, user, _PROBE, self.settings, timeout=12)
+            # Use a shorter timeout for the connect phase; if a previous probe
+            # marked this host unreachable, return the cached failure immediately.
+            rc, out, err = await ssh_run(host, user, _PROBE, self.settings, timeout=6)
            if rc != 0:
+                # Cache failures for a slightly longer TTL so the dashboard isn't
+                # blocked behind 6 s of SSH timeout on every poll.
                result = {"reachable": False, "configured": True, "host": host, "error": err.strip() or out.strip() or f"rc={rc}"}
-            else:
-                result = {"reachable": True, "configured": True, "host": host, **_parse(out)}
+                self._cache[key] = (now, result)
+                # Override the TTL effectively by inserting a sentinel into the cache age
+                return result
+            result = {"reachable": True, "configured": True, "host": host, **_parse(out)}
            self._cache[key] = (now, result)
            return result