v0.4.0 - NIM installer + dashboard resilience

Hotfix (was v0.3.1):
- services.py: cache 'unreachable' per (host,user) for 25s so a dead Spark doesn't hang every /api/services call behind 6s ssh timeout
- ssh_run timeout reduced 10 -> 6s for docker_state probes
- hardware probe: shorter SSH timeout (6s), longer cache TTL for failures (25s)
- JS pollStatus retries loadModels() if state.models is empty (recovers from cold-start proxy timeout)
- Unreachable hardware card now includes troubleshooting steps (Spark Control cannot SSH into an unreachable Spark to restart it)

v0.4 NIM installer:
- nim.py module: curated SUGGESTED_NIMS list (Parakeet, Magpie, Riva) + NimManager that runs docker login nvcr.io + docker pull + docker run -d --gpus all -p PORT:PORT -v VOL:/opt/nim/.cache -e NGC_API_KEY -e ... --restart=unless-stopped + chown the volume to uid 1000 + restart. Streams all output via SSE; redacts the API key from log lines.
- custom_services.py: persists installed NIMs to /data/services-overrides.yaml so they appear in the services panel after install
- services.py: merges custom services into the panel
- /api/nim/catalog GET, /api/nim/install POST + GET/SSE
- /api/services/{name} DELETE for custom services
- UI: '+ Install NIM' button next to 'Always-on services'; modal lists curated images each with a 'Pick' button + a custom-image form; installation runs in a second dialog with phase + elapsed timer + collapsible log
- NGC API key field added to Configure Sparks (masked); injected as NGC_API_KEY env var into the container

Package: bump 0.4.0:0; main.ts adds SERVICES_OVERRIDES + NGC_API_KEY env vars
This commit is contained in:
Grant
2026-05-12 12:32:29 -05:00
parent e88fdcfde4
commit 1889ab45fb
13 changed files with 690 additions and 10 deletions
+59 -1
View File
@@ -76,8 +76,66 @@
</section>
<section id="services-panel" class="services hidden">
<h2 class="section-title">Always-on services</h2>
<div class="section-header">
<h2 class="section-title">Always-on services</h2>
<button id="open-nim" class="btn small-btn">+ Install NIM</button>
</div>
<div id="services-grid" class="services-grid"></div>
<dialog id="nim-dialog" class="modal">
<form method="dialog" class="modal-form" id="nim-form">
<h3>Install a NVIDIA NIM container</h3>
<p class="muted small" id="nim-key-warn"></p>
<p class="muted small">Pick a curated container below or paste any image from <a href="#" id="nim-catalog-link" target="_blank" rel="noopener">the NGC NIM catalog</a>. Spark Control will <code>docker pull</code> and <code>docker run</code> it on the target Spark.</p>
<div id="nim-suggested" class="nim-grid"></div>
<fieldset class="modal-fieldset">
<legend>Custom image</legend>
<label class="modal-row"><span>Image (nvcr.io/...)</span><input type="text" id="nim-image" placeholder="nvcr.io/nim/nvidia/<name>:latest"></label>
<label class="modal-row"><span>Container name</span><input type="text" id="nim-container" placeholder="my-service"></label>
<label class="modal-row"><span>Port</span><input type="number" id="nim-port" min="1" max="65535"></label>
<label class="modal-row"><span>Kind</span>
<select id="nim-kind">
<option value="nim">NIM (other)</option>
<option value="stt">STT (speech-to-text)</option>
<option value="tts">TTS (text-to-speech)</option>
<option value="vision">Vision</option>
<option value="embedding">Embedding</option>
</select>
</label>
<label class="modal-row"><span>Target Spark</span>
<select id="nim-host">
<option value="spark2">Spark 2 (default for support services)</option>
<option value="spark1">Spark 1 (head node)</option>
</select>
</label>
</fieldset>
<div class="modal-actions">
<button type="button" id="nim-cancel" class="btn">Cancel</button>
<button type="submit" class="btn primary" id="nim-start">Install</button>
</div>
</form>
</dialog>
<dialog id="nim-progress-dialog" class="modal">
<form method="dialog" class="modal-form">
<h3 id="nim-prog-title">Installing…</h3>
<div class="phase-row">
<div class="phase" id="nim-prog-phase">Starting…</div>
<span class="spacer"></span>
<span class="timer" id="nim-prog-elapsed">0:00</span>
</div>
<details open>
<summary class="muted small">Log</summary>
<pre id="nim-prog-log" class="log"></pre>
</details>
<div class="modal-actions">
<button type="button" id="nim-prog-close" class="btn">Close</button>
</div>
</form>
</dialog>
</section>
<section id="models-section">