3 Commits

Author SHA1 Message Date
Grant ee8c2406b8 v0.6.0 - Service-level connectivity tracking + passive failure-report endpoint
connectivity.py:
- Generalized 'spark' subject to any string; renamed 'spark' field to 'subject'
- Legacy v0.5 events with the old 'spark' field are migrated transparently on read (kind defaults to 'transition')
- New record_report(subject, ok, source, detail, latency_ms): always appends an event with kind='report'; does NOT mutate the current state (only active polling is authoritative)
- summary() returns events normalized to the new schema

Wiring:
- /api/status now calls record_state for vllm/parakeet/magpie (dedup on no-change)
- /api/services calls record_state for each service after its http check
- Result: dashboard observes service-level transitions automatically with no extra polling

Passive endpoint:
- POST /api/health-event with {service, ok, source?, error?, ms?}
- Useful for external apps (e.g. Open WebUI) to surface sub-poll-interval failures the dashboard would otherwise miss

UI:
- Connectivity dialog groups events by subject (hosts ordered first, then services)
- Per-subject summary shows transition count, down count, report count, failed-report count
- Transitions and reports render inline with distinct styling; reports show source app + error + latency
- Legacy v0.5 events render unchanged

Docs:
- README documents /api/health-event with a curl example

Package: bump to 0.6.0:0
2026-05-12 13:19:27 -05:00
Grant a02f4db850 v0.5.0 - Wake-on-LAN + connectivity history
wol.py:
- build_magic_packet(): standard 6x0xFF + 16x MAC layout
- send_local_broadcast(): direct from container (ports 9 + 7 for safety)
- send_via_peer(): preferred path; SSHes to the OTHER Spark and runs a Python one-liner there so the packet originates on the target's LAN segment (most reliable)
- MAC validation + normalization

connectivity.py:
- /data/connectivity.json persistence (thread-safe, atomic rename)
- Stores per-Spark current state + last_change timestamp + rolling 200-event log
- Records up/down transitions; computes down_seconds / up_seconds durations
- MAC cache populated lazily during hardware probes

hardware.py:
- Probe now reads MAC via /sys/class/net/<default-route-iface>/address
- After each probe, record_state() emits a transition event if state changed
- record_mac() caches the address so WoL works when the Spark next goes down

Endpoints:
- GET /api/connectivity: macs, current state, last_change, events[]
- POST /api/spark/{name}/wake: tries via-peer first, falls back to direct broadcast

UI:
- Unreachable hardware card shows the cached MAC + 'Wake (WoL)' button (only if MAC known)
- New 'Connectivity log' button opens a modal with per-Spark transition history (last 25 each), including duration of each prior up/down period
- pollHardware also pulls /api/connectivity so WoL buttons appear without an extra fetch

Package: bump 0.5.0:0; main.ts sets CONNECTIVITY_LOG=/data/connectivity.json
2026-05-12 12:51:49 -05:00
Grant 1889ab45fb v0.4.0 - NIM installer + dashboard resilience
Hotfix (was v0.3.1):
- services.py: cache 'unreachable' per (host,user) for 25s so a dead Spark doesn't hang every /api/services call behind 6s ssh timeout
- ssh_run timeout reduced 10 -> 6s for docker_state probes
- hardware probe: shorter SSH timeout (6s), longer cache TTL for failures (25s)
- JS pollStatus retries loadModels() if state.models is empty (recovers from cold-start proxy timeout)
- Unreachable hardware card now includes troubleshooting steps (Spark Control cannot SSH into an unreachable Spark to restart it)

v0.4 NIM installer:
- nim.py module: curated SUGGESTED_NIMS list (Parakeet, Magpie, Riva) + NimManager that runs docker login nvcr.io + docker pull + docker run -d --gpus all -p PORT:PORT -v VOL:/opt/nim/.cache -e NGC_API_KEY -e ... --restart=unless-stopped + chown the volume to uid 1000 + restart. Streams all output via SSE; redacts the API key from log lines.
- custom_services.py: persists installed NIMs to /data/services-overrides.yaml so they appear in the services panel after install
- services.py: merges custom services into the panel
- /api/nim/catalog GET, /api/nim/install POST + GET/SSE
- /api/services/{name} DELETE for custom services
- UI: '+ Install NIM' button next to 'Always-on services'; modal lists curated images each with a 'Pick' button + a custom-image form; installation runs in a second dialog with phase + elapsed timer + collapsible log
- NGC API key field added to Configure Sparks (masked); injected as NGC_API_KEY env var into the container

Package: bump 0.4.0:0; main.ts adds SERVICES_OVERRIDES + NGC_API_KEY env vars
2026-05-12 12:32:29 -05:00
16 changed files with 1228 additions and 11 deletions
+18
View File
@@ -84,6 +84,24 @@ Other services on your LAN can hit `GET /api/endpoints` to learn where the curre
`base_url` is filled in whenever Configure Sparks has been completed (even if the underlying service isn't currently up). Pair the URL with `ready: true` to safely route traffic. `base_url` is filled in whenever Configure Sparks has been completed (even if the underlying service isn't currently up). Pair the URL with `ready: true` to safely route traffic.
## Reporting failures from external apps
Spark Control polls every 5 s, so a brief blip in Parakeet/Magpie/vLLM availability can slip between polls and never make it into the connectivity log. To capture short failures, an external app (e.g. Open WebUI) can POST whenever a call fails (or succeeds):
```bash
curl -X POST http://<dashboard-url>/api/health-event \
-H 'content-type: application/json' \
-d '{
"service": "parakeet",
"ok": false,
"source": "open-webui",
"error": "HTTP 503",
"ms": 420
}'
```
Fields: `service` (required), `ok` (required), `source` (optional, free-form), `error` (optional), `ms` (optional latency). Each POST appends a `report` event to the connectivity log alongside the polling-based transition events.
## Status ## Status
**v0.2.3** — installed and verified on a Start9 server. Five bundled LLMs in the catalog (qwen3-vl, gemma4, qwen36, qwen3-235b-fp8, qwen2.5-72b), plus any custom models added through the UI. **v0.2.3** — installed and verified on a Start9 server. Five bundled LLMs in the catalog (qwen3-vl, gemma4, qwen36, qwen3-235b-fp8, qwen2.5-72b), plus any custom models added through the UI.
+2
View File
@@ -43,6 +43,7 @@ class Settings:
magpie_port: int magpie_port: int
bind_port: int bind_port: int
open_webui_url: str open_webui_url: str
ngc_api_key: str
@classmethod @classmethod
def from_env(cls) -> "Settings": def from_env(cls) -> "Settings":
@@ -68,6 +69,7 @@ class Settings:
magpie_port=int(_env("MAGPIE_PORT", "9000")), magpie_port=int(_env("MAGPIE_PORT", "9000")),
bind_port=int(_env("BIND_PORT", "9999")), bind_port=int(_env("BIND_PORT", "9999")),
open_webui_url=_env("OPEN_WEBUI_URL", ""), open_webui_url=_env("OPEN_WEBUI_URL", ""),
ngc_api_key=_env("NGC_API_KEY", ""),
) )
@property @property
+190
View File
@@ -0,0 +1,190 @@
"""Track up/down transitions for any subject (Sparks AND services) and cache MACs.
Persisted to /data/connectivity.json. Schema:
{
"macs": { "spark1": "aa:bb:..", "spark2": "11:22:.." },
"current": { "spark1": "up", "parakeet": "up", "magpie": "down", ... },
"last_change": { ... },
"events": [
# Active-probe transition (logged when state flips during polling)
{ "subject": "spark2", "at": "...", "kind": "transition",
"transition": "down" },
{ "subject": "spark2", "at": "...", "kind": "transition",
"transition": "up", "down_seconds": 4500 },
# Passive report (logged whenever an external app POSTs to
# /api/health-event regardless of state change)
{ "subject": "parakeet", "at": "...", "kind": "report",
"ok": false, "source": "open-webui",
"detail": "Connection refused", "latency_ms": 320 },
]
}
Legacy events from v0.5 with `spark` instead of `subject` and no `kind` field
are read transparently as kind="transition".
"""
from __future__ import annotations
import json
import os
import threading
from datetime import datetime, timezone
from pathlib import Path
from typing import Optional
MAX_EVENTS = 200 # rolling window — plenty for showing recent history
def _path() -> str:
return os.environ.get("CONNECTIVITY_LOG", "/data/connectivity.json")
_lock = threading.Lock()
def _read() -> dict:
try:
with open(_path()) as f:
return json.load(f) or {}
except (FileNotFoundError, json.JSONDecodeError):
return {}
def _write(data: dict) -> None:
p = _path()
Path(p).parent.mkdir(parents=True, exist_ok=True)
tmp = p + ".tmp"
with open(tmp, "w") as f:
json.dump(data, f, indent=2, sort_keys=False)
os.replace(tmp, p)
def load() -> dict:
with _lock:
d = _read()
d.setdefault("macs", {})
d.setdefault("current", {})
d.setdefault("last_change", {})
d.setdefault("events", [])
return d
def record_mac(subject: str, mac: Optional[str]) -> None:
if not mac:
return
with _lock:
d = _read()
d.setdefault("macs", {})
if d["macs"].get(subject) != mac:
d["macs"][subject] = mac
_write(d)
def record_state(subject: str, reachable: bool) -> Optional[dict]:
"""Update current state for `subject`. If it differs from the last seen
state, append a transition event. Returns the event dict if a transition
was recorded, else None.
`subject` can be a Spark host key (spark1/spark2) or a service name
(parakeet/magpie/vllm).
"""
new_state = "up" if reachable else "down"
now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
with _lock:
d = _read()
d.setdefault("macs", {})
d.setdefault("current", {})
d.setdefault("last_change", {})
d.setdefault("events", [])
prev = d["current"].get(subject)
if prev == new_state:
return None
event: dict = {
"subject": subject,
"at": now,
"kind": "transition",
"transition": new_state,
}
# When we have a previous state and timestamp, compute duration
last_change = d["last_change"].get(subject)
if prev and last_change:
try:
prev_dt = datetime.fromisoformat(last_change.replace("Z", "+00:00"))
duration = (datetime.now(timezone.utc) - prev_dt).total_seconds()
if prev == "down" and new_state == "up":
event["down_seconds"] = round(duration)
if prev == "up" and new_state == "down":
event["up_seconds"] = round(duration)
except ValueError:
pass
d["current"][subject] = new_state
d["last_change"][subject] = now
d["events"].append(event)
if len(d["events"]) > MAX_EVENTS:
d["events"] = d["events"][-MAX_EVENTS:]
_write(d)
return event
def record_report(
subject: str,
*,
ok: bool,
source: str = "external",
detail: str = "",
latency_ms: Optional[int] = None,
) -> dict:
"""Record a passive report from an external caller (e.g. Open WebUI got a
503 calling Parakeet). Always appended to the events list; does NOT change
the active-probe state (which only the polling probe is authoritative on).
"""
now = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
with _lock:
d = _read()
d.setdefault("events", [])
event: dict = {
"subject": subject,
"at": now,
"kind": "report",
"ok": bool(ok),
"source": source or "external",
}
if detail:
event["detail"] = detail
if latency_ms is not None:
event["latency_ms"] = int(latency_ms)
d["events"].append(event)
if len(d["events"]) > MAX_EVENTS:
d["events"] = d["events"][-MAX_EVENTS:]
_write(d)
return event
def get_mac(subject: str) -> Optional[str]:
d = load()
return d.get("macs", {}).get(subject)
def _normalize_event(e: dict) -> dict:
"""Promote legacy v0.5 events to the v0.6 shape so the UI sees one schema."""
if "subject" in e:
e.setdefault("kind", "transition")
return e
# Legacy: had "spark" + "transition" only
if "spark" in e:
e["subject"] = e.pop("spark")
e.setdefault("kind", "transition")
return e
def summary() -> dict:
"""Compact summary for the UI: known MACs, current state, recent events."""
d = load()
events = [_normalize_event(dict(e)) for e in d.get("events", [])]
return {
"macs": d.get("macs", {}),
"current": d.get("current", {}),
"last_change": d.get("last_change", {}),
"events": events[-80:],
}
+59
View File
@@ -0,0 +1,59 @@
"""User-installed services persist in /data/services-overrides.yaml.
Format:
custom:
- key: my-riva
kind: stt
host: <spark-2-ip>
user: <spark-user>
container: riva-asr
port: 8001
health_path: /health
image: nvcr.io/nim/nvidia/riva-multilingual:latest
"""
from __future__ import annotations
import os
from pathlib import Path
import yaml
def _path() -> str:
return os.environ.get("SERVICES_OVERRIDES", "/data/services-overrides.yaml")
def load_custom_services() -> list[dict]:
try:
with open(_path()) as f:
data = yaml.safe_load(f) or {}
except FileNotFoundError:
return []
return data.get("custom") or []
def add_custom_service(entry: dict) -> None:
p = _path()
Path(p).parent.mkdir(parents=True, exist_ok=True)
data: dict = {}
try:
with open(p) as f:
data = yaml.safe_load(f) or {}
except FileNotFoundError:
pass
custom = data.get("custom") or []
custom = [c for c in custom if c.get("key") != entry["key"]]
custom.append(entry)
data["custom"] = custom
with open(p, "w") as f:
yaml.safe_dump(data, f, sort_keys=False)
def delete_custom_service(key: str) -> None:
p = _path()
try:
with open(p) as f:
data = yaml.safe_load(f) or {}
except FileNotFoundError:
return
data["custom"] = [c for c in (data.get("custom") or []) if c.get("key") != key]
with open(p, "w") as f:
yaml.safe_dump(data, f, sort_keys=False)
+23 -5
View File
@@ -10,6 +10,7 @@ import time
from typing import Any from typing import Any
from .config import Settings from .config import Settings
from .connectivity import record_mac, record_state
from .ssh import ssh_run from .ssh import ssh_run
@@ -23,6 +24,8 @@ echo MEMORY=$(free -b 2>/dev/null | awk '/^Mem:/ {print $2, $3}')
echo DISK=$(df -B1 / 2>/dev/null | awk 'NR==2 {print $2, $3}') echo DISK=$(df -B1 / 2>/dev/null | awk 'NR==2 {print $2, $3}')
echo GPU=$(nvidia-smi --query-gpu=name,utilization.gpu,temperature.gpu,power.draw,memory.total --format=csv,noheader,nounits 2>/dev/null | head -1) echo GPU=$(nvidia-smi --query-gpu=name,utilization.gpu,temperature.gpu,power.draw,memory.total --format=csv,noheader,nounits 2>/dev/null | head -1)
echo GPU_MEM_USED_MIB=$(nvidia-smi --query-compute-apps=used_gpu_memory --format=csv,noheader,nounits 2>/dev/null | awk '{s+=$1} END {print s+0}') echo GPU_MEM_USED_MIB=$(nvidia-smi --query-compute-apps=used_gpu_memory --format=csv,noheader,nounits 2>/dev/null | awk '{s+=$1} END {print s+0}')
DEFIF=$(ip route show default 2>/dev/null | awk '{print $5; exit}')
echo MAC=$(cat /sys/class/net/$DEFIF/address 2>/dev/null)
""".strip() """.strip()
@@ -78,18 +81,25 @@ def _parse(out: str) -> dict:
# Sum per-process compute memory (works even on unified-memory systems) # Sum per-process compute memory (works even on unified-memory systems)
if info.get("gpu_mem_used_mib"): if info.get("gpu_mem_used_mib"):
parsed["gpu_mem_used_mib"] = _parse_int(info["gpu_mem_used_mib"]) parsed["gpu_mem_used_mib"] = _parse_int(info["gpu_mem_used_mib"])
# MAC address on the default-route interface (for Wake-on-LAN)
if info.get("mac"):
parsed["mac"] = info["mac"].lower()
return parsed return parsed
class HardwareProbe: class HardwareProbe:
"""Caches results briefly to avoid hammering the Sparks.""" """Caches results briefly to avoid hammering the Sparks."""
def __init__(self, settings: Settings, ttl_sec: float = 4.0) -> None: def __init__(self, settings: Settings, ttl_sec: float = 4.0, fail_ttl_sec: float = 25.0) -> None:
self.settings = settings self.settings = settings
self.ttl_sec = ttl_sec self.ttl_sec = ttl_sec
self.fail_ttl_sec = fail_ttl_sec
self._cache: dict[str, tuple[float, dict]] = {} self._cache: dict[str, tuple[float, dict]] = {}
self._locks: dict[str, asyncio.Lock] = {} self._locks: dict[str, asyncio.Lock] = {}
def _ttl_for(self, value: dict) -> float:
return self.ttl_sec if value.get("reachable") else self.fail_ttl_sec
def _lock(self, key: str) -> asyncio.Lock: def _lock(self, key: str) -> asyncio.Lock:
if key not in self._locks: if key not in self._locks:
self._locks[key] = asyncio.Lock() self._locks[key] = asyncio.Lock()
@@ -108,12 +118,20 @@ class HardwareProbe:
async with self._lock(key): async with self._lock(key):
now = time.monotonic() now = time.monotonic()
cached = self._cache.get(key) cached = self._cache.get(key)
if cached and (now - cached[0] < self.ttl_sec): if cached and (now - cached[0] < self._ttl_for(cached[1])):
return cached[1] return cached[1]
rc, out, err = await ssh_run(host, user, _PROBE, self.settings, timeout=12) # Use a shorter timeout for the connect phase; if a previous probe
# marked this host unreachable, return the cached failure immediately.
rc, out, err = await ssh_run(host, user, _PROBE, self.settings, timeout=6)
if rc != 0: if rc != 0:
result = {"reachable": False, "configured": True, "host": host, "error": err.strip() or out.strip() or f"rc={rc}"} result = {"reachable": False, "configured": True, "host": host, "error": err.strip() or out.strip() or f"rc={rc}"}
else: self._cache[key] = (now, result)
result = {"reachable": True, "configured": True, "host": host, **_parse(out)} record_state(key, False)
return result
parsed = _parse(out)
result = {"reachable": True, "configured": True, "host": host, **parsed}
self._cache[key] = (now, result) self._cache[key] = (now, result)
record_state(key, True)
if parsed.get("mac"):
record_mac(key, parsed["mac"])
return result return result
+202
View File
@@ -0,0 +1,202 @@
"""NVIDIA NIM container install / lifecycle.
Two pieces:
* A small curated catalog of NIM images (so users don't have to copy/paste
huge nvcr.io URLs).
* An installer that SSHes into the target Spark, runs `docker pull` then
`docker run -d --gpus all -p PORT:PORT -v VOLUME:/opt/nim/.cache
-e NGC_API_KEY=... IMAGE` and streams output.
Custom services also persist via `overrides.add_custom_service()` so the
Services panel can show them.
"""
from __future__ import annotations
import asyncio
import uuid
from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Optional
from .config import Settings
from .ssh import ssh_stream, StreamHandle
# Curated list. These are the most useful NIM containers for a dual-Spark
# audio-and-LLM setup. Browse the full catalog at
# https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia
CATALOG_URL = "https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers"
SUGGESTED_NIMS: list[dict] = [
{
"key": "parakeet-tdt-0.6b-v3",
"name": "Parakeet TDT 0.6B v3",
"image": "nvcr.io/nim/nvidia/parakeet-tdt-0-6b-v3:latest",
"default_container": "parakeet-asr",
"default_port": 8000,
"kind": "stt",
"description": "Streaming speech-to-text (English). Used by Open WebUI for voice input. ~1 GB.",
"homepage": "https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/parakeet-tdt-0-6b-v3",
},
{
"key": "magpie-tts-multilingual",
"name": "Magpie TTS Multilingual",
"image": "nvcr.io/nim/nvidia/magpie-tts-multilingual:latest",
"default_container": "magpie-tts",
"default_port": 9000,
"kind": "tts",
"description": "Multilingual text-to-speech. Counterpart to Parakeet for 'read aloud'. ~3 GB.",
"homepage": "https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/magpie-tts-multilingual",
},
{
"key": "riva-multilingual",
"name": "Riva Multilingual ASR",
"image": "nvcr.io/nim/nvidia/riva-multilingual:latest",
"default_container": "riva-asr",
"default_port": 8001,
"kind": "stt",
"description": "NVIDIA Riva speech-recognition multi-language model. Larger and more accurate than Parakeet.",
"homepage": "https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia",
},
]
@dataclass
class NimInstallJob:
id: str
image: str
container: str
port: int
host: str
user: str
volume: Optional[str]
started_at: str
state: str = "starting" # starting | pulling | running | done | failed
phase: str = "Starting…"
lines: list[str] = field(default_factory=list)
returncode: Optional[int] = None
finished_at: Optional[str] = None
def append(self, line: str) -> None:
self.lines.append(line)
if len(self.lines) > 1000:
del self.lines[: len(self.lines) - 1000]
class NimManager:
def __init__(self, settings: Settings) -> None:
self.settings = settings
self.lock = asyncio.Lock()
self.jobs: dict[str, NimInstallJob] = {}
self.current_job_id: Optional[str] = None
def get(self, job_id: str) -> NimInstallJob | None:
return self.jobs.get(job_id)
async def trigger(
self,
*,
image: str,
container: str,
port: int,
host: str,
user: str,
volume: str | None = None,
extra_env: dict[str, str] | None = None,
) -> NimInstallJob:
if self.lock.locked():
raise RuntimeError("Another NIM install is already in progress")
if not host or not user:
raise RuntimeError("target host not configured")
if not self.settings.ngc_api_key:
raise RuntimeError(
"NGC_API_KEY is not set. Open Configure Sparks in StartOS and paste your NGC personal API key (free at https://ngc.nvidia.com/setup/personal-key)."
)
job = NimInstallJob(
id=uuid.uuid4().hex[:8],
image=image,
container=container,
port=port,
host=host,
user=user,
volume=volume or f"{container}-cache",
started_at=datetime.now(timezone.utc).isoformat(),
)
self.jobs[job.id] = job
self.current_job_id = job.id
asyncio.create_task(self._run(job, extra_env or {}))
return job
async def _run(self, job: NimInstallJob, extra_env: dict[str, str]) -> None:
async with self.lock:
try:
await self._do(job, extra_env)
if job.state != "failed":
job.state = "done"
job.returncode = 0
job.phase = "Done"
except Exception as e:
job.append(f"[error] {type(e).__name__}: {e}")
job.state = "failed"
if job.returncode is None:
job.returncode = 1
finally:
job.finished_at = datetime.now(timezone.utc).isoformat()
if self.current_job_id == job.id:
self.current_job_id = None
async def _do(self, job: NimInstallJob, extra_env: dict[str, str]) -> None:
# Build the bash one-liner. We use docker login non-interactively with the NGC API key.
env_parts = [f'-e NGC_API_KEY=$NGC_API_KEY']
for k, v in extra_env.items():
env_parts.append(f"-e {k}={v}")
env_str = " ".join(env_parts)
cmd = (
f"set -e; "
f"export NGC_API_KEY='{self.settings.ngc_api_key}'; "
f"echo '=== docker login nvcr.io ==='; "
f"echo \"$NGC_API_KEY\" | docker login nvcr.io -u '$oauthtoken' --password-stdin; "
f"echo '=== docker pull {job.image} (this can be 1-10 GB) ==='; "
f"docker pull {job.image}; "
f"echo '=== remove any prior container with the same name ==='; "
f"docker rm -f {job.container} 2>/dev/null || true; "
f"echo '=== docker run -d --gpus all -p {job.port}:{job.port} -v {job.volume}:/opt/nim/.cache {env_str} --name {job.container} --restart unless-stopped {job.image} ==='; "
f"docker run -d --gpus all "
f"-p {job.port}:{job.port} "
f"-v {job.volume}:/opt/nim/.cache "
f"{env_str} "
f"--name {job.container} "
f"--restart unless-stopped "
f"{job.image}; "
f"echo '=== ensuring cache volume is writable by uid 1000 (riva-server) ==='; "
f"docker run --rm -v {job.volume}:/cache alpine chown -R 1000:1000 /cache && "
f"docker restart {job.container}; "
f"echo '=== install complete; container is starting up and will download its model on first boot ==='"
)
job.append(f"$ <install command for {job.image} on {job.host}>")
job.state = "pulling"
job.phase = "Pulling image from nvcr.io (this can take a few minutes)…"
handle = StreamHandle()
async for line in ssh_stream(job.host, job.user, cmd, self.settings, handle=handle):
# Don't log lines containing the api key
if self.settings.ngc_api_key and self.settings.ngc_api_key in line:
continue
job.append(line)
if "docker pull" in line:
job.phase = "Pulling image from nvcr.io…"
elif "Login Succeeded" in line:
job.phase = "Logged in to NGC; pulling image…"
elif "Pull complete" in line:
job.phase = "Pulling layers…"
elif "Status: Downloaded newer image" in line or "Image is up to date" in line:
job.phase = "Image ready; starting container…"
elif "docker run -d" in line:
job.state = "running"
job.phase = "Container starting; downloading model on first boot…"
rc = handle.returncode or 0
if rc != 0:
job.state = "failed"
job.returncode = rc
+188
View File
@@ -10,15 +10,19 @@ from pydantic import BaseModel
from typing import Literal from typing import Literal
from .config import Settings from .config import Settings
from .connectivity import get_mac, record_report, record_state, summary as connectivity_summary
from .custom_services import add_custom_service, delete_custom_service
from .download import DownloadManager from .download import DownloadManager
from .hardware import HardwareProbe from .hardware import HardwareProbe
from .health import check_magpie, check_parakeet, check_vllm from .health import check_magpie, check_parakeet, check_vllm
from .models import load_catalog from .models import load_catalog
from .nim import SUGGESTED_NIMS, CATALOG_URL, NimManager
from .overrides import add_custom, delete_custom, extract_knobs_from_args, load_overrides, set_knobs from .overrides import add_custom, delete_custom, extract_knobs_from_args, load_overrides, set_knobs
from .services import docker_state, run_action, services_from_settings from .services import docker_state, run_action, services_from_settings
from .ssh import ssh_run from .ssh import ssh_run
from .swap import SwapManager from .swap import SwapManager
from .updates import UpdateManager, get_update_status from .updates import UpdateManager, get_update_status
from .wol import send_local_broadcast, send_via_peer
settings = Settings.from_env() settings = Settings.from_env()
@@ -27,6 +31,7 @@ swap_manager = SwapManager(settings, catalog)
download_manager = DownloadManager(settings) download_manager = DownloadManager(settings)
update_manager = UpdateManager(settings) update_manager = UpdateManager(settings)
hardware_probe = HardwareProbe(settings) hardware_probe = HardwareProbe(settings)
nim_manager = NimManager(settings)
app = FastAPI(title="spark-control", version="0.1.0") app = FastAPI(title="spark-control", version="0.1.0")
@@ -125,6 +130,81 @@ async def get_hardware() -> dict:
return await hardware_probe.fetch() return await hardware_probe.fetch()
@app.get("/api/connectivity")
async def get_connectivity() -> dict:
"""Up/down transition log per Spark + cached MACs."""
return connectivity_summary()
class HealthEventBody(BaseModel):
service: str # e.g. "parakeet", "magpie", "vllm"
ok: bool # true on success, false on failure
source: str | None = None # what app reported (e.g. "open-webui")
error: str | None = None # optional detail
ms: int | None = None # optional latency
@app.post("/api/health-event")
async def post_health_event(body: HealthEventBody) -> dict:
"""Passive endpoint: any LAN app can POST here when its call to one of our
services succeeds or (more usefully) fails. We log the report into the
connectivity history so a brief blip that polling misses still surfaces.
Example:
curl -X POST http://<dashboard>/api/health-event \\
-H 'content-type: application/json' \\
-d '{"service":"parakeet","ok":false,"error":"503","source":"open-webui","ms":420}'
"""
if not body.service.strip():
raise HTTPException(400, "service is required")
event = record_report(
body.service.strip(),
ok=body.ok,
source=(body.source or "external").strip(),
detail=(body.error or "").strip(),
latency_ms=body.ms,
)
return {"ok": True, "recorded": event}
@app.post("/api/spark/{name}/wake")
async def wake_spark(name: str) -> dict:
"""Send a Wake-on-LAN magic packet for the named Spark.
Tries the OTHER Spark (if reachable) first because the packet has to
originate on the target's LAN segment to be reliable. Falls back to a
direct UDP broadcast from this container.
"""
if name not in ("spark1", "spark2"):
raise HTTPException(404, f"unknown spark: {name}")
mac = get_mac(name)
if not mac:
raise HTTPException(400, f"MAC for {name} not yet known; bring it up once so we can probe it, then this will work next time it sleeps")
# Find the peer's connectivity to decide the path.
other = "spark2" if name == "spark1" else "spark1"
other_host = settings.spark1_host if other == "spark1" else settings.spark2_host
other_user = settings.spark1_user if other == "spark1" else settings.spark2_user
delivered_via = None
via_peer_ok = False
via_peer_err = ""
if other_host and other_user:
via_peer_ok, via_peer_err = await send_via_peer(other_host, other_user, mac, settings)
if via_peer_ok:
delivered_via = other
if not via_peer_ok:
# Fall back to direct from this container
try:
send_local_broadcast(mac)
delivered_via = "container"
except Exception as e:
raise HTTPException(500, f"WoL failed: peer={via_peer_err!r} container={e!r}")
return {"ok": True, "spark": name, "mac": mac, "delivered_via": delivered_via}
@app.get("/api/services") @app.get("/api/services")
async def get_services() -> dict: async def get_services() -> dict:
"""Lifecycle state of always-on support services (Parakeet, Magpie, …). """Lifecycle state of always-on support services (Parakeet, Magpie, …).
@@ -167,9 +247,113 @@ async def get_services() -> dict:
results = await asyncio.gather(*[one(n) for n in services.keys()]) results = await asyncio.gather(*[one(n) for n in services.keys()])
for name, info in results: for name, info in results:
out[name] = info out[name] = info
# Feed http reachability into the connectivity log (transition-only)
record_state(name, bool(info.get("http_ready")))
return out return out
@app.get("/api/nim/catalog")
async def get_nim_catalog() -> dict:
return {
"catalog_url": CATALOG_URL,
"ngc_key_configured": bool(settings.ngc_api_key),
"suggested": SUGGESTED_NIMS,
}
class NimInstallBody(BaseModel):
image: str
container: str
port: int
host: Literal["spark1", "spark2"] = "spark2"
kind: str = ""
register: bool = True # write to custom services overrides after install
@app.post("/api/nim/install")
async def post_nim_install(body: NimInstallBody) -> dict:
target_host = settings.spark1_host if body.host == "spark1" else settings.spark2_host
target_user = settings.spark1_user if body.host == "spark1" else settings.spark2_user
try:
job = await nim_manager.trigger(
image=body.image,
container=body.container,
port=body.port,
host=target_host,
user=target_user,
)
except RuntimeError as e:
raise HTTPException(409 if "in progress" in str(e) else 400, str(e))
if body.register:
# Persist in custom services so the panel shows it after install.
add_custom_service({
"key": body.container,
"kind": body.kind or "nim",
"host": target_host,
"user": target_user,
"container": body.container,
"port": body.port,
"image": body.image,
})
return {"job_id": job.id, "image": job.image, "container": job.container, "state": job.state}
@app.get("/api/nim/install/{job_id}")
async def get_nim_install(job_id: str) -> dict:
job = nim_manager.get(job_id)
if job is None:
raise HTTPException(404, "no such job")
return {
"id": job.id,
"image": job.image,
"container": job.container,
"port": job.port,
"host": job.host,
"state": job.state,
"phase": job.phase,
"started_at": job.started_at,
"finished_at": job.finished_at,
"returncode": job.returncode,
"lines": job.lines,
}
@app.get("/api/nim/install/{job_id}/stream")
async def stream_nim_install(job_id: str):
job = nim_manager.get(job_id)
if job is None:
raise HTTPException(404, "no such job")
async def gen():
sent = 0
last_phase = None
while True:
n = len(job.lines)
if n > sent:
for line in job.lines[sent:n]:
yield f"data: {json.dumps({'line': line})}\n\n"
sent = n
if job.phase != last_phase:
yield f"event: phase\ndata: {json.dumps({'state': job.state, 'phase': job.phase})}\n\n"
last_phase = job.phase
if job.returncode is not None and sent >= len(job.lines):
yield f"event: done\ndata: {json.dumps({'state': job.state, 'returncode': job.returncode})}\n\n"
return
await asyncio.sleep(0.5)
return StreamingResponse(gen(), media_type="text/event-stream")
@app.delete("/api/services/{name}")
async def del_service(name: str) -> dict:
# Only allow deleting custom services (not the bundled parakeet/magpie keys)
if name in ("parakeet", "magpie"):
raise HTTPException(400, "built-in service; cannot delete (use Configure Sparks to point at a different host)")
delete_custom_service(name)
return {"ok": True, "name": name}
@app.post("/api/services/{name}/{action}") @app.post("/api/services/{name}/{action}")
async def service_action(name: str, action: str) -> dict: async def service_action(name: str, action: str) -> dict:
services = services_from_settings(settings) services = services_from_settings(settings)
@@ -221,6 +405,10 @@ async def get_status() -> dict:
check_parakeet(settings), check_parakeet(settings),
check_magpie(settings), check_magpie(settings),
) )
# Feed health into the connectivity log (deduped — only logs on transition)
record_state("vllm", bool(vllm.get("ok")))
record_state("parakeet", bool(parakeet.get("ok")))
record_state("magpie", bool(magpie.get("ok")))
current_key = _identify_current_model(vllm.get("current_model")) current_key = _identify_current_model(vllm.get("current_model"))
return { return {
"configured": settings.configured, "configured": settings.configured,
+42 -2
View File
@@ -5,6 +5,7 @@ machinery. We just run `docker start|stop|restart <container>` via SSH on the
appropriate host. appropriate host.
""" """
from __future__ import annotations from __future__ import annotations
import time
from dataclasses import dataclass from dataclasses import dataclass
from typing import Literal, Optional from typing import Literal, Optional
@@ -12,6 +13,25 @@ from .config import Settings
from .ssh import ssh_run from .ssh import ssh_run
# Cache the "unreachable" verdict per (host, user) for a short period so that a
# repeated docker_state call doesn't re-pay the 6 s SSH connect timeout each time.
_UNREACHABLE_TTL = 25.0
_unreachable_cache: dict[tuple[str, str], float] = {}
def _is_recently_unreachable(host: str, user: str) -> bool:
ts = _unreachable_cache.get((host, user))
return bool(ts and time.monotonic() - ts < _UNREACHABLE_TTL)
def _mark_unreachable(host: str, user: str) -> None:
_unreachable_cache[(host, user)] = time.monotonic()
def _clear_unreachable(host: str, user: str) -> None:
_unreachable_cache.pop((host, user), None)
ServiceName = Literal["parakeet", "magpie"] ServiceName = Literal["parakeet", "magpie"]
ServiceAction = Literal["start", "stop", "restart"] ServiceAction = Literal["start", "stop", "restart"]
@@ -27,7 +47,8 @@ class ServiceDef:
def services_from_settings(s: Settings) -> dict[str, ServiceDef]: def services_from_settings(s: Settings) -> dict[str, ServiceDef]:
return { from .custom_services import load_custom_services
out: dict[str, ServiceDef] = {
"parakeet": ServiceDef( "parakeet": ServiceDef(
name="parakeet", name="parakeet",
kind="stt", kind="stt",
@@ -45,19 +66,38 @@ def services_from_settings(s: Settings) -> dict[str, ServiceDef]:
port=s.magpie_port, port=s.magpie_port,
), ),
} }
for entry in load_custom_services():
key = entry.get("key")
if not key or key in out:
continue
out[key] = ServiceDef(
name=key,
kind=entry.get("kind", ""),
host=entry.get("host", ""),
user=entry.get("user", ""),
container=entry.get("container", key),
port=int(entry.get("port", 0)),
)
return out
async def docker_state(settings: Settings, svc: ServiceDef) -> dict: async def docker_state(settings: Settings, svc: ServiceDef) -> dict:
"""Get docker state (running, exited, restarting, etc.) + restart count.""" """Get docker state (running, exited, restarting, etc.) + restart count."""
if not svc.host or not svc.user: if not svc.host or not svc.user:
return {"state": "unconfigured", "restart_count": None, "uptime": None} return {"state": "unconfigured", "restart_count": None, "uptime": None}
if _is_recently_unreachable(svc.host, svc.user):
return {"state": "unreachable", "host_unreachable": True, "restart_count": None, "uptime": None}
cmd = ( cmd = (
f"docker inspect {svc.container} " f"docker inspect {svc.container} "
f"--format '{{{{.State.Status}}}}|{{{{.State.StartedAt}}}}|{{{{.RestartCount}}}}|{{{{.State.ExitCode}}}}|{{{{.State.Error}}}}' " f"--format '{{{{.State.Status}}}}|{{{{.State.StartedAt}}}}|{{{{.RestartCount}}}}|{{{{.State.ExitCode}}}}|{{{{.State.Error}}}}' "
f"2>&1 || echo 'NOT_FOUND'" f"2>&1 || echo 'NOT_FOUND'"
) )
rc, out, _ = await ssh_run(svc.host, svc.user, cmd, settings, timeout=10) rc, out, _ = await ssh_run(svc.host, svc.user, cmd, settings, timeout=6)
out = out.strip() out = out.strip()
if rc == 124 or "timeout after" in out.lower():
_mark_unreachable(svc.host, svc.user)
return {"state": "unreachable", "host_unreachable": True, "restart_count": None, "uptime": None}
_clear_unreachable(svc.host, svc.user)
if rc != 0 or out.startswith("NOT_FOUND") or "Error" in out and "no such object" in out.lower(): if rc != 0 or out.startswith("NOT_FOUND") or "Error" in out and "no such object" in out.lower():
return {"state": "missing", "restart_count": None, "uptime": None, "raw": out} return {"state": "missing", "restart_count": None, "uptime": None, "raw": out}
parts = out.split("|") parts = out.split("|")
+274
View File
@@ -121,10 +121,110 @@ function bar(usedPct, warn) {
async function pollHardware() { async function pollHardware() {
try { try {
state.hardware = await fetchJSON('/api/hardware'); state.hardware = await fetchJSON('/api/hardware');
try { state.connectivity = await fetchJSON('/api/connectivity'); } catch {}
renderHardware(); renderHardware();
} catch (e) { console.warn('hardware poll failed', e); } } catch (e) { console.warn('hardware poll failed', e); }
} }
function fmtDuration(sec) {
if (sec == null) return '';
if (sec < 60) return `${Math.round(sec)}s`;
if (sec < 3600) return `${Math.round(sec / 60)}m`;
if (sec < 86400) {
const h = Math.floor(sec / 3600);
const m = Math.round((sec % 3600) / 60);
return m ? `${h}h ${m}m` : `${h}h`;
}
const d = Math.floor(sec / 86400);
const h = Math.round((sec % 86400) / 3600);
return h ? `${d}d ${h}h` : `${d}d`;
}
function openConnectivityDialog() {
const dlg = el('#connectivity-dialog');
const content = el('#connectivity-content');
const c = state.connectivity || {};
const events = c.events || [];
if (events.length === 0) {
content.innerHTML = '<div class="muted small">No events recorded yet. Once a Spark or service goes down and back up (or an external app reports a failure), entries appear here.</div>';
dlg.showModal();
return;
}
const bySubject = {};
for (const e of events) {
const subj = e.subject || e.spark || 'unknown'; // legacy fallback
(bySubject[subj] = bySubject[subj] || []).push(e);
}
// Sort subjects: hosts first, then services, alphabetical
const hostOrder = ['spark1', 'spark2'];
const subjects = Object.keys(bySubject).sort((a, b) => {
const ia = hostOrder.indexOf(a);
const ib = hostOrder.indexOf(b);
if (ia >= 0 && ib >= 0) return ia - ib;
if (ia >= 0) return -1;
if (ib >= 0) return 1;
return a.localeCompare(b);
});
const html = subjects.map((subj) => {
const evs = bySubject[subj];
const transitions = evs.filter(e => (e.kind || 'transition') === 'transition');
const reports = evs.filter(e => e.kind === 'report');
const downs = transitions.filter(e => e.transition === 'down').length;
const failedReports = reports.filter(e => !e.ok).length;
const mac = c.macs?.[subj];
const summaryParts = [];
if (transitions.length) summaryParts.push(`${transitions.length} probe transition${transitions.length===1?'':'s'} (${downs} down)`);
if (reports.length) summaryParts.push(`${reports.length} app report${reports.length===1?'':'s'} (${failedReports} failed)`);
const isHost = hostOrder.includes(subj);
return `
<div class="conn-spark">
<h4>${escapeHtml(subj)}${isHost ? ' <span class="muted small">[host]</span>' : ' <span class="muted small">[service]</span>'}${mac ? ` <span class="muted small">${escapeHtml(mac)}</span>` : ''}</h4>
<div class="conn-summary">${summaryParts.join(' · ') || 'no events'}</div>
${evs.slice(-30).reverse().map(e => renderConnEvent(e)).join('')}
</div>
`;
}).join('');
content.innerHTML = html;
dlg.showModal();
}
function renderConnEvent(e) {
const when = escapeHtml((e.at || '').replace('T', ' ').replace('Z', ''));
const kind = e.kind || 'transition';
if (kind === 'report') {
const ok = !!e.ok;
const source = escapeHtml(e.source || 'external');
const detail = e.detail ? `${escapeHtml(e.detail)}` : '';
const latency = e.latency_ms != null ? ` (${e.latency_ms} ms)` : '';
return `
<div class="conn-event ${ok ? 'up' : 'down'} report">
<span class="when">${when}</span>
<span class="what">${ok ? '◷ report: ok' : '◷ report: failed'} <span class="muted">from</span> ${source}${detail}</span>
<span class="dur">${latency}</span>
</div>
`;
}
const down = e.down_seconds != null ? `was down ${fmtDuration(e.down_seconds)}` : '';
const up = e.up_seconds != null ? `was up ${fmtDuration(e.up_seconds)}` : '';
return `
<div class="conn-event ${e.transition}">
<span class="when">${when}</span>
<span class="what">${e.transition === 'up' ? '↑ came back online' : '↓ dropped offline'}</span>
<span class="dur">${down}${up}</span>
</div>
`;
}
async function wakeSpark(name) {
try {
const r = await fetchJSON(`/api/spark/${name}/wake`, { method: 'POST' });
alert(`Wake-on-LAN sent to ${name} (MAC ${r.mac}, via ${r.delivered_via}). Give it ~30 seconds to wake; the card will go green when it comes back.`);
} catch (e) {
alert(`Wake failed: ${e.message}`);
}
}
function renderHardware() { function renderHardware() {
const panel = el('#hardware-panel'); const panel = el('#hardware-panel');
const grid = el('#hardware-grid'); const grid = el('#hardware-grid');
@@ -138,12 +238,30 @@ function renderHardware() {
const card = document.createElement('div'); const card = document.createElement('div');
if (!s.reachable) { if (!s.reachable) {
card.className = 'hw-card unreachable'; card.className = 'hw-card unreachable';
const mac = state.connectivity?.macs?.[key];
const wolRow = mac
? `<div class="wol-row">
<span class="mac-display">${escapeHtml(mac)}</span>
<span class="spacer"></span>
<button class="btn" data-wake="${escapeHtml(key)}">Wake (WoL)</button>
</div>`
: `<div class="muted small">MAC not yet known — once it's been up once with this dashboard installed, "Wake" will appear here.</div>`;
card.innerHTML = ` card.innerHTML = `
<div class="head"> <div class="head">
<span class="name">${escapeHtml(key)}</span> <span class="name">${escapeHtml(key)}</span>
<span class="meta">unreachable</span> <span class="meta">unreachable</span>
</div> </div>
<div class="muted small">${escapeHtml(s.host || '')}${escapeHtml(s.error || 'no response')}</div> <div class="muted small">${escapeHtml(s.host || '')}${escapeHtml(s.error || 'no response')}</div>
${wolRow}
<div class="muted small" style="line-height:1.5">
If Wake-on-LAN doesn't bring it back, manual steps:
<ol style="margin: 6px 0 0 18px; padding: 0;">
<li>Verify it's powered on (check the front LED).</li>
<li>Ping it from another LAN device.</li>
<li>Power-cycle it physically.</li>
<li>If it boots, this card will go green again automatically.</li>
</ol>
</div>
`; `;
grid.appendChild(card); grid.appendChild(card);
continue; continue;
@@ -510,6 +628,10 @@ async function pollStatus() {
renderCurrent(status); renderCurrent(status);
renderEndpoint(status); renderEndpoint(status);
renderHealth(status); renderHealth(status);
// If models hasn't loaded yet (init may have hit a transient proxy timeout), retry.
if (!state.models || Object.keys(state.models).length === 0) {
try { await loadModels(); } catch {}
}
// Refresh services state lazily — every 5s poll triggers this too. // Refresh services state lazily — every 5s poll triggers this too.
try { try {
state.services = await fetchJSON('/api/services'); state.services = await fetchJSON('/api/services');
@@ -953,6 +1075,147 @@ function setupAdvancedDialog() {
el('#adv-gmu').addEventListener('input', (e) => { el('#adv-gmu-out').value = parseFloat(e.target.value).toFixed(2); }); el('#adv-gmu').addEventListener('input', (e) => { el('#adv-gmu-out').value = parseFloat(e.target.value).toFixed(2); });
} }
// ===================== NIM installer =====================
const nimState = {
catalog: null,
job_id: null,
eventsource: null,
timer: null,
started_at: null,
};
async function loadNimCatalog() {
try {
nimState.catalog = await fetchJSON('/api/nim/catalog');
el('#nim-catalog-link').href = nimState.catalog.catalog_url;
const warn = el('#nim-key-warn');
if (!nimState.catalog.ngc_key_configured) {
warn.classList.add('nim-key-warn');
warn.innerHTML = '⚠️ NGC API key not set. Open <strong>Configure Sparks</strong> in StartOS and paste your NGC personal API key, otherwise installs will fail. <a href="https://ngc.nvidia.com/setup/personal-key" target="_blank" rel="noopener">Get a key</a>';
} else {
warn.classList.remove('nim-key-warn');
warn.textContent = '';
}
const grid = el('#nim-suggested');
grid.innerHTML = '';
for (const s of nimState.catalog.suggested || []) {
const card = document.createElement('div');
card.className = 'nim-card';
card.innerHTML = `
<div class="info">
<div class="name">${escapeHtml(s.name)} <span class="muted small">· ${escapeHtml(s.kind || 'nim')}</span></div>
<div class="desc">${escapeHtml(s.description || '')}</div>
<div class="img">${escapeHtml(s.image)}</div>
<div class="links">${s.homepage ? `<a href="${escapeHtml(s.homepage)}" target="_blank" rel="noopener">View on NGC ↗</a>` : ''}</div>
</div>
<button type="button" class="btn primary nim-pick" data-image="${escapeHtml(s.image)}" data-container="${escapeHtml(s.default_container)}" data-port="${s.default_port}" data-kind="${escapeHtml(s.kind)}">Pick</button>
`;
grid.appendChild(card);
}
grid.querySelectorAll('.nim-pick').forEach(btn => {
btn.addEventListener('click', () => {
el('#nim-image').value = btn.dataset.image;
el('#nim-container').value = btn.dataset.container;
el('#nim-port').value = btn.dataset.port;
el('#nim-kind').value = btn.dataset.kind || 'nim';
});
});
} catch (e) { console.warn('nim catalog failed', e); }
}
function openNimDialog() {
loadNimCatalog();
el('#nim-dialog').showModal();
}
async function submitNim(e) {
e.preventDefault();
const body = {
image: el('#nim-image').value.trim(),
container: el('#nim-container').value.trim(),
port: parseInt(el('#nim-port').value, 10),
host: el('#nim-host').value,
kind: el('#nim-kind').value,
};
if (!body.image || !body.container || !body.port) {
alert('Image, container name, and port are required.');
return;
}
try {
const r = await fetchJSON('/api/nim/install', {
method: 'POST',
headers: { 'content-type': 'application/json' },
body: JSON.stringify(body),
});
el('#nim-dialog').close();
attachNimProgress(r.job_id);
} catch (e) {
alert('Install failed: ' + e.message);
}
}
function nimTimerStart(at) {
nimState.started_at = at;
if (nimState.timer) clearInterval(nimState.timer);
const tick = () => {
if (!nimState.started_at) return;
const sec = Math.max(0, Math.floor((Date.now() - nimState.started_at) / 1000));
const m = Math.floor(sec / 60);
const s = sec % 60;
el('#nim-prog-elapsed').textContent = `${m}:${s.toString().padStart(2, '0')}`;
};
tick();
nimState.timer = setInterval(tick, 500);
}
async function attachNimProgress(jobId) {
nimState.job_id = jobId;
el('#nim-prog-log').textContent = '';
el('#nim-prog-title').textContent = 'Installing…';
el('#nim-progress-dialog').showModal();
try {
const snap = await fetchJSON(`/api/nim/install/${jobId}`);
nimTimerStart(Date.parse(snap.started_at));
el('#nim-prog-phase').textContent = snap.phase || 'Working…';
el('#nim-prog-log').textContent = (snap.lines || []).join('\n');
if (snap.returncode !== null) { onNimDone(snap); return; }
} catch { nimTimerStart(Date.now()); }
const es = new EventSource(`/api/nim/install/${jobId}/stream`);
nimState.eventsource = es;
es.onmessage = ev => {
try {
const d = JSON.parse(ev.data);
if (d.line !== undefined) {
const log = el('#nim-prog-log');
log.textContent += d.line + '\n';
log.scrollTop = log.scrollHeight;
}
} catch {}
};
es.addEventListener('phase', ev => {
try { el('#nim-prog-phase').textContent = JSON.parse(ev.data).phase; } catch {}
});
es.addEventListener('done', ev => {
let d = {}; try { d = JSON.parse(ev.data); } catch {}
onNimDone(d);
});
es.onerror = () => { es.close(); nimState.eventsource = null; };
}
function onNimDone(d) {
if (nimState.eventsource) { nimState.eventsource.close(); nimState.eventsource = null; }
if (nimState.timer) { clearInterval(nimState.timer); nimState.timer = null; }
if (d.state === 'failed') {
el('#nim-prog-title').textContent = `Failed (rc=${d.returncode})`;
el('#nim-prog-phase').textContent = 'Failed';
} else {
el('#nim-prog-title').textContent = 'Installed';
el('#nim-prog-phase').textContent = 'Done ✓ — service will appear when the container reports healthy.';
}
pollStatus();
}
// ===================== Explain context (LLM commit summary) ===================== // ===================== Explain context (LLM commit summary) =====================
let explainEventSource = null; let explainEventSource = null;
@@ -1149,6 +1412,17 @@ async function init() {
el('#ub-apply').addEventListener('click', applyUpdate); el('#ub-apply').addEventListener('click', applyUpdate);
el('#ub-explain').addEventListener('click', explainContext); el('#ub-explain').addEventListener('click', explainContext);
el('#dl-repo').addEventListener('input', updateDlHfLink); el('#dl-repo').addEventListener('input', updateDlHfLink);
el('#open-nim').addEventListener('click', openNimDialog);
el('#nim-cancel').addEventListener('click', () => el('#nim-dialog').close());
el('#nim-form').addEventListener('submit', submitNim);
el('#nim-prog-close').addEventListener('click', () => el('#nim-progress-dialog').close());
el('#open-connectivity').addEventListener('click', openConnectivityDialog);
el('#connectivity-close').addEventListener('click', () => el('#connectivity-dialog').close());
// Wake-on-LAN buttons live on unreachable hardware cards; delegate.
el('#hardware-grid').addEventListener('click', (e) => {
const btn = e.target.closest('[data-wake]');
if (btn) wakeSpark(btn.dataset.wake);
});
setupCatalogDialog(); setupCatalogDialog();
setupAdvancedDialog(); setupAdvancedDialog();
// Open WebUI link from /api/config // Open WebUI link from /api/config
+74 -2
View File
@@ -26,8 +26,22 @@
</section> </section>
<section id="hardware-panel" class="hardware-panel hidden"> <section id="hardware-panel" class="hardware-panel hidden">
<h2 class="section-title">Spark hardware</h2> <div class="section-header">
<h2 class="section-title">Spark hardware</h2>
<button id="open-connectivity" class="btn small-btn">Connectivity log</button>
</div>
<div id="hardware-grid" class="hardware-grid"></div> <div id="hardware-grid" class="hardware-grid"></div>
<dialog id="connectivity-dialog" class="modal">
<form method="dialog" class="modal-form">
<h3>Spark connectivity history</h3>
<p class="muted small">Most recent up/down transitions per Spark. Tracked since this dashboard was installed.</p>
<div id="connectivity-content" class="connectivity-content"></div>
<div class="modal-actions">
<button type="button" id="connectivity-close" class="btn">Close</button>
</div>
</form>
</dialog>
</section> </section>
<section id="endpoint-panel" class="endpoint-panel hidden"> <section id="endpoint-panel" class="endpoint-panel hidden">
@@ -76,8 +90,66 @@
</section> </section>
<section id="services-panel" class="services hidden"> <section id="services-panel" class="services hidden">
<h2 class="section-title">Always-on services</h2> <div class="section-header">
<h2 class="section-title">Always-on services</h2>
<button id="open-nim" class="btn small-btn">+ Install NIM</button>
</div>
<div id="services-grid" class="services-grid"></div> <div id="services-grid" class="services-grid"></div>
<dialog id="nim-dialog" class="modal">
<form method="dialog" class="modal-form" id="nim-form">
<h3>Install a NVIDIA NIM container</h3>
<p class="muted small" id="nim-key-warn"></p>
<p class="muted small">Pick a curated container below or paste any image from <a href="#" id="nim-catalog-link" target="_blank" rel="noopener">the NGC NIM catalog</a>. Spark Control will <code>docker pull</code> and <code>docker run</code> it on the target Spark.</p>
<div id="nim-suggested" class="nim-grid"></div>
<fieldset class="modal-fieldset">
<legend>Custom image</legend>
<label class="modal-row"><span>Image (nvcr.io/...)</span><input type="text" id="nim-image" placeholder="nvcr.io/nim/nvidia/<name>:latest"></label>
<label class="modal-row"><span>Container name</span><input type="text" id="nim-container" placeholder="my-service"></label>
<label class="modal-row"><span>Port</span><input type="number" id="nim-port" min="1" max="65535"></label>
<label class="modal-row"><span>Kind</span>
<select id="nim-kind">
<option value="nim">NIM (other)</option>
<option value="stt">STT (speech-to-text)</option>
<option value="tts">TTS (text-to-speech)</option>
<option value="vision">Vision</option>
<option value="embedding">Embedding</option>
</select>
</label>
<label class="modal-row"><span>Target Spark</span>
<select id="nim-host">
<option value="spark2">Spark 2 (default for support services)</option>
<option value="spark1">Spark 1 (head node)</option>
</select>
</label>
</fieldset>
<div class="modal-actions">
<button type="button" id="nim-cancel" class="btn">Cancel</button>
<button type="submit" class="btn primary" id="nim-start">Install</button>
</div>
</form>
</dialog>
<dialog id="nim-progress-dialog" class="modal">
<form method="dialog" class="modal-form">
<h3 id="nim-prog-title">Installing…</h3>
<div class="phase-row">
<div class="phase" id="nim-prog-phase">Starting…</div>
<span class="spacer"></span>
<span class="timer" id="nim-prog-elapsed">0:00</span>
</div>
<details open>
<summary class="muted small">Log</summary>
<pre id="nim-prog-log" class="log"></pre>
</details>
<div class="modal-actions">
<button type="button" id="nim-prog-close" class="btn">Close</button>
</div>
</form>
</dialog>
</section> </section>
<section id="models-section"> <section id="models-section">
+70
View File
@@ -376,6 +376,45 @@ main {
.hw-card .head .meta { color: var(--muted); font-size: 12px; margin-left: auto; } .hw-card .head .meta { color: var(--muted); font-size: 12px; margin-left: auto; }
.hw-card.unreachable { border-color: rgba(239, 68, 68, 0.4); } .hw-card.unreachable { border-color: rgba(239, 68, 68, 0.4); }
.hw-card.unreachable .name { color: var(--error); } .hw-card.unreachable .name { color: var(--error); }
.hw-card.unreachable ol { color: var(--muted); }
.hw-card .wol-row {
margin-top: 8px;
display: flex;
align-items: center;
gap: 8px;
font-size: 12px;
color: var(--muted);
}
.hw-card .wol-row .btn { padding: 5px 10px; font-size: 12px; }
.hw-card .mac-display { font-family: ui-monospace, SFMono-Regular, Menlo, monospace; }
.connectivity-content {
max-height: 360px;
overflow-y: auto;
border: 1px solid var(--border);
border-radius: 6px;
padding: 10px;
background: var(--surface-2);
}
.conn-spark { margin-bottom: 16px; }
.conn-spark h4 { font-size: 13px; margin: 0 0 8px; color: var(--text); }
.conn-event {
font-size: 12px;
display: flex;
gap: 10px;
padding: 4px 0;
border-bottom: 1px solid rgba(255,255,255,0.04);
font-family: ui-monospace, SFMono-Regular, Menlo, monospace;
}
.conn-event:last-child { border-bottom: 0; }
.conn-event .when { color: var(--muted); flex-shrink: 0; }
.conn-event .what { flex: 1; }
.conn-event.up .what { color: var(--accent); }
.conn-event.down .what { color: var(--error); }
.conn-event.report .what { font-style: italic; }
.conn-event .muted { color: var(--muted); font-style: normal; }
.conn-event .dur { color: var(--muted); }
.conn-summary { color: var(--muted); font-size: 11px; padding: 4px 0 10px; }
.hw-metric { display: flex; align-items: center; gap: 10px; font-size: 12px; } .hw-metric { display: flex; align-items: center; gap: 10px; font-size: 12px; }
.hw-metric .label { color: var(--muted); width: 56px; flex-shrink: 0; text-transform: uppercase; letter-spacing: 0.05em; font-size: 11px; } .hw-metric .label { color: var(--muted); width: 56px; flex-shrink: 0; text-transform: uppercase; letter-spacing: 0.05em; font-size: 11px; }
.hw-metric .bar { flex: 1; height: 8px; background: var(--surface-2); border-radius: 4px; overflow: hidden; position: relative; } .hw-metric .bar { flex: 1; height: 8px; background: var(--surface-2); border-radius: 4px; overflow: hidden; position: relative; }
@@ -477,6 +516,37 @@ main {
#dl-log-details { margin-top: 12px; } #dl-log-details { margin-top: 12px; }
#dl-log-details summary { cursor: pointer; padding: 4px 0; } #dl-log-details summary { cursor: pointer; padding: 4px 0; }
/* ===== NIM install dialog ===== */
.modal#nim-dialog,
.modal#nim-progress-dialog { max-width: 640px; }
.nim-grid {
display: grid;
gap: 8px;
grid-template-columns: 1fr;
max-height: 240px;
overflow-y: auto;
margin-bottom: 4px;
}
.nim-card {
background: var(--surface-2);
border: 1px solid var(--border);
border-radius: 6px;
padding: 10px 12px;
display: flex;
gap: 10px;
align-items: flex-start;
}
.nim-card .info { flex: 1; }
.nim-card .name { font-weight: 600; font-size: 13px; }
.nim-card .desc { color: var(--muted); font-size: 12px; margin-top: 4px; }
.nim-card .img { font-family: ui-monospace, SFMono-Regular, Menlo, monospace; color: #6b6b75; font-size: 11px; margin-top: 4px; word-break: break-all; }
.nim-card .btn { padding: 6px 12px; font-size: 12px; flex-shrink: 0; }
.nim-card .links { font-size: 11px; margin-top: 4px; }
.nim-card .links a { color: var(--info); text-decoration: none; }
.nim-card .links a:hover { text-decoration: underline; }
.nim-key-warn { color: var(--warn); }
/* ===== Section titles ===== */ /* ===== Section titles ===== */
.section-title { .section-title {
+69
View File
@@ -0,0 +1,69 @@
"""Wake-on-LAN.
Two delivery paths, tried in order:
1. SSH into the other Spark and have IT broadcast — most reliable because the
packet originates from the same LAN subnet as the sleeping Spark.
2. Direct UDP broadcast from this container. May or may not work depending
on the StartOS container's network namespace.
The DGX Spark's NIC must have WoL enabled in firmware/OS for either path to
actually wake the box; this module just delivers the magic packet correctly.
"""
from __future__ import annotations
import asyncio
import re
import socket
from .config import Settings
from .ssh import ssh_run
_MAC_RE = re.compile(r"^[0-9a-fA-F]{2}([:-]?[0-9a-fA-F]{2}){5}$")
def normalize_mac(mac: str) -> str:
mac = mac.strip().lower()
if not _MAC_RE.match(mac):
raise ValueError(f"invalid MAC address: {mac!r}")
return mac.replace("-", ":")
def build_magic_packet(mac: str) -> bytes:
mac_bytes = bytes.fromhex(normalize_mac(mac).replace(":", ""))
return b"\xff" * 6 + mac_bytes * 16
def send_local_broadcast(mac: str, broadcast: str = "255.255.255.255", port: int = 9) -> None:
"""Send from THIS container. May not reach the LAN in some topologies."""
pkt = build_magic_packet(mac)
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
try:
s.setsockopt(socket.SOL_SOCKET, socket.SO_BROADCAST, 1)
s.sendto(pkt, (broadcast, port))
# Also send to port 7 (alternate WoL convention) for safety
s.sendto(pkt, (broadcast, 7))
finally:
s.close()
async def send_via_peer(host: str, user: str, mac: str, settings: Settings) -> tuple[bool, str]:
"""Use a different (reachable) Spark to send the WoL packet to its peer.
Uses Python 3 (always present on the Sparks for vLLM) to avoid depending on
wakeonlan / etherwake being installed.
"""
normalized = normalize_mac(mac)
mac_hex = normalized.replace(":", "")
py = (
"python3 -c \""
"import socket; "
f"m=bytes.fromhex('{mac_hex}'); "
"s=socket.socket(socket.AF_INET, socket.SOCK_DGRAM); "
"s.setsockopt(socket.SOL_SOCKET, socket.SO_BROADCAST, 1); "
"s.sendto(b'\\xff'*6 + m*16, ('255.255.255.255', 9)); "
"s.sendto(b'\\xff'*6 + m*16, ('255.255.255.255', 7)); "
"print('sent')\""
)
rc, out, err = await ssh_run(host, user, py, settings, timeout=8)
return rc == 0 and "sent" in out, (err.strip() or out.strip() or f"rc={rc}")
@@ -85,6 +85,15 @@ const inputSpec = InputSpec.of({
placeholder: 'e.g. https://open-webui.yourserver.local', placeholder: 'e.g. https://open-webui.yourserver.local',
masked: false, masked: false,
}), }),
ngc_api_key: Value.text({
name: 'NGC API key (optional)',
description:
'NVIDIA NGC personal API key — needed to install NIM containers (Parakeet, Magpie, etc.) from nvcr.io. Get one free at https://ngc.nvidia.com/setup/personal-key. Stored only on this Start9 server; passed to docker as the NGC_API_KEY env var when installing NIM services.',
required: false,
default: null,
placeholder: 'starts with "nvapi-..."',
masked: true,
}),
}) })
export const configureSparks = sdk.Action.withInput( export const configureSparks = sdk.Action.withInput(
@@ -16,6 +16,8 @@ export const sparkConfigSchema = z.object({
magpie_container: z.string().catch(''), magpie_container: z.string().catch(''),
// Optional Open WebUI deep-link // Optional Open WebUI deep-link
open_webui_url: z.string().catch(''), open_webui_url: z.string().catch(''),
// Optional NGC API key for pulling NIM containers from nvcr.io/nim/...
ngc_api_key: z.string().catch(''),
}) })
export type SparkConfig = z.infer<typeof sparkConfigSchema> export type SparkConfig = z.infer<typeof sparkConfigSchema>
+4
View File
@@ -20,6 +20,7 @@ export const main = sdk.setupMain(async ({ effects }) => {
magpie_user: '', magpie_user: '',
magpie_container: '', magpie_container: '',
open_webui_url: '', open_webui_url: '',
ngc_api_key: '',
} }
return sdk.Daemons.of(effects).addDaemon('primary', { return sdk.Daemons.of(effects).addDaemon('primary', {
@@ -48,7 +49,10 @@ export const main = sdk.setupMain(async ({ effects }) => {
MAGPIE_USER: cfg.magpie_user, MAGPIE_USER: cfg.magpie_user,
MAGPIE_CONTAINER: cfg.magpie_container, MAGPIE_CONTAINER: cfg.magpie_container,
MODELS_OVERRIDES: '/data/models-overrides.yaml', MODELS_OVERRIDES: '/data/models-overrides.yaml',
SERVICES_OVERRIDES: '/data/services-overrides.yaml',
CONNECTIVITY_LOG: '/data/connectivity.json',
OPEN_WEBUI_URL: cfg.open_webui_url, OPEN_WEBUI_URL: cfg.open_webui_url,
NGC_API_KEY: cfg.ngc_api_key,
BIND_PORT: String(uiPort), BIND_PORT: String(uiPort),
}, },
}, },
+2 -2
View File
@@ -1,10 +1,10 @@
import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk' import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk'
export const v0_1_0 = VersionInfo.of({ export const v0_1_0 = VersionInfo.of({
version: '0.3.0:1', version: '0.6.0:0',
releaseNotes: { releaseNotes: {
en_US: en_US:
'v0.3: Spark hardware dashboard (RAM, disk, GPU memory + utilization, CPU load, uptime per Spark). Per-model Advanced settings now show plain-English hints tied to your actual GPU memory (e.g. "0.85 GPU util leaves ~18 GB free"). "Explain context" button on the update banner asks the loaded LLM to summarize pending commits in plain English. Optional Open WebUI URL in Configure Sparks shows a one-click "Open chat" button in the top bar. Downloads can now target Spark 1, Spark 2, or both. Each model card links out to its Hugging Face page.', 'v0.6: Service-level connectivity tracking and a passive failure-report endpoint. The connectivity log now records up/down transitions for Parakeet, Magpie, and vLLM in addition to the Spark hosts (driven by the existing /api/status and /api/services polling). A new POST /api/health-event endpoint lets external apps (e.g. Open WebUI) record failures they observed even when the failure was brief enough to slip between polls. The Connectivity log dialog shows hosts and services with separate badges, and reports appear inline with their source app + error detail.',
}, },
migrations: { migrations: {
up: async ({ effects }) => {}, up: async ({ effects }) => {},