1889ab45fb
Hotfix (was v0.3.1):
- services.py: cache 'unreachable' per (host,user) for 25s so a dead Spark doesn't hang every /api/services call behind 6s ssh timeout
- ssh_run timeout reduced 10 -> 6s for docker_state probes
- hardware probe: shorter SSH timeout (6s), longer cache TTL for failures (25s)
- JS pollStatus retries loadModels() if state.models is empty (recovers from cold-start proxy timeout)
- Unreachable hardware card now includes troubleshooting steps (Spark Control cannot SSH into an unreachable Spark to restart it)
v0.4 NIM installer:
- nim.py module: curated SUGGESTED_NIMS list (Parakeet, Magpie, Riva) + NimManager that runs docker login nvcr.io + docker pull + docker run -d --gpus all -p PORT:PORT -v VOL:/opt/nim/.cache -e NGC_API_KEY -e ... --restart=unless-stopped + chown the volume to uid 1000 + restart. Streams all output via SSE; redacts the API key from log lines.
- custom_services.py: persists installed NIMs to /data/services-overrides.yaml so they appear in the services panel after install
- services.py: merges custom services into the panel
- /api/nim/catalog GET, /api/nim/install POST + GET/SSE
- /api/services/{name} DELETE for custom services
- UI: '+ Install NIM' button next to 'Always-on services'; modal lists curated images each with a 'Pick' button + a custom-image form; installation runs in a second dialog with phase + elapsed timer + collapsible log
- NGC API key field added to Configure Sparks (masked); injected as NGC_API_KEY env var into the container
Package: bump 0.4.0:0; main.ts adds SERVICES_OVERRIDES + NGC_API_KEY env vars
60 lines
1.5 KiB
Python
60 lines
1.5 KiB
Python
"""User-installed services persist in /data/services-overrides.yaml.
|
|
|
|
Format:
|
|
custom:
|
|
- key: my-riva
|
|
kind: stt
|
|
host: <spark-2-ip>
|
|
user: <spark-user>
|
|
container: riva-asr
|
|
port: 8001
|
|
health_path: /health
|
|
image: nvcr.io/nim/nvidia/riva-multilingual:latest
|
|
"""
|
|
from __future__ import annotations
|
|
import os
|
|
from pathlib import Path
|
|
import yaml
|
|
|
|
|
|
def _path() -> str:
|
|
return os.environ.get("SERVICES_OVERRIDES", "/data/services-overrides.yaml")
|
|
|
|
|
|
def load_custom_services() -> list[dict]:
|
|
try:
|
|
with open(_path()) as f:
|
|
data = yaml.safe_load(f) or {}
|
|
except FileNotFoundError:
|
|
return []
|
|
return data.get("custom") or []
|
|
|
|
|
|
def add_custom_service(entry: dict) -> None:
|
|
p = _path()
|
|
Path(p).parent.mkdir(parents=True, exist_ok=True)
|
|
data: dict = {}
|
|
try:
|
|
with open(p) as f:
|
|
data = yaml.safe_load(f) or {}
|
|
except FileNotFoundError:
|
|
pass
|
|
custom = data.get("custom") or []
|
|
custom = [c for c in custom if c.get("key") != entry["key"]]
|
|
custom.append(entry)
|
|
data["custom"] = custom
|
|
with open(p, "w") as f:
|
|
yaml.safe_dump(data, f, sort_keys=False)
|
|
|
|
|
|
def delete_custom_service(key: str) -> None:
|
|
p = _path()
|
|
try:
|
|
with open(p) as f:
|
|
data = yaml.safe_load(f) or {}
|
|
except FileNotFoundError:
|
|
return
|
|
data["custom"] = [c for c in (data.get("custom") or []) if c.get("key") != key]
|
|
with open(p, "w") as f:
|
|
yaml.safe_dump(data, f, sort_keys=False)
|