v0.4.0 - NIM installer + dashboard resilience
Hotfix (was v0.3.1):
- services.py: cache 'unreachable' per (host,user) for 25s so a dead Spark doesn't hang every /api/services call behind 6s ssh timeout
- ssh_run timeout reduced 10 -> 6s for docker_state probes
- hardware probe: shorter SSH timeout (6s), longer cache TTL for failures (25s)
- JS pollStatus retries loadModels() if state.models is empty (recovers from cold-start proxy timeout)
- Unreachable hardware card now includes troubleshooting steps (Spark Control cannot SSH into an unreachable Spark to restart it)
v0.4 NIM installer:
- nim.py module: curated SUGGESTED_NIMS list (Parakeet, Magpie, Riva) + NimManager that runs docker login nvcr.io + docker pull + docker run -d --gpus all -p PORT:PORT -v VOL:/opt/nim/.cache -e NGC_API_KEY -e ... --restart=unless-stopped + chown the volume to uid 1000 + restart. Streams all output via SSE; redacts the API key from log lines.
- custom_services.py: persists installed NIMs to /data/services-overrides.yaml so they appear in the services panel after install
- services.py: merges custom services into the panel
- /api/nim/catalog GET, /api/nim/install POST + GET/SSE
- /api/services/{name} DELETE for custom services
- UI: '+ Install NIM' button next to 'Always-on services'; modal lists curated images each with a 'Pick' button + a custom-image form; installation runs in a second dialog with phase + elapsed timer + collapsible log
- NGC API key field added to Configure Sparks (masked); injected as NGC_API_KEY env var into the container
Package: bump 0.4.0:0; main.ts adds SERVICES_OVERRIDES + NGC_API_KEY env vars
This commit is contained in:
@@ -85,6 +85,15 @@ const inputSpec = InputSpec.of({
|
||||
placeholder: 'e.g. https://open-webui.yourserver.local',
|
||||
masked: false,
|
||||
}),
|
||||
ngc_api_key: Value.text({
|
||||
name: 'NGC API key (optional)',
|
||||
description:
|
||||
'NVIDIA NGC personal API key — needed to install NIM containers (Parakeet, Magpie, etc.) from nvcr.io. Get one free at https://ngc.nvidia.com/setup/personal-key. Stored only on this Start9 server; passed to docker as the NGC_API_KEY env var when installing NIM services.',
|
||||
required: false,
|
||||
default: null,
|
||||
placeholder: 'starts with "nvapi-..."',
|
||||
masked: true,
|
||||
}),
|
||||
})
|
||||
|
||||
export const configureSparks = sdk.Action.withInput(
|
||||
|
||||
@@ -16,6 +16,8 @@ export const sparkConfigSchema = z.object({
|
||||
magpie_container: z.string().catch(''),
|
||||
// Optional Open WebUI deep-link
|
||||
open_webui_url: z.string().catch(''),
|
||||
// Optional NGC API key for pulling NIM containers from nvcr.io/nim/...
|
||||
ngc_api_key: z.string().catch(''),
|
||||
})
|
||||
|
||||
export type SparkConfig = z.infer<typeof sparkConfigSchema>
|
||||
|
||||
@@ -20,6 +20,7 @@ export const main = sdk.setupMain(async ({ effects }) => {
|
||||
magpie_user: '',
|
||||
magpie_container: '',
|
||||
open_webui_url: '',
|
||||
ngc_api_key: '',
|
||||
}
|
||||
|
||||
return sdk.Daemons.of(effects).addDaemon('primary', {
|
||||
@@ -48,7 +49,9 @@ export const main = sdk.setupMain(async ({ effects }) => {
|
||||
MAGPIE_USER: cfg.magpie_user,
|
||||
MAGPIE_CONTAINER: cfg.magpie_container,
|
||||
MODELS_OVERRIDES: '/data/models-overrides.yaml',
|
||||
SERVICES_OVERRIDES: '/data/services-overrides.yaml',
|
||||
OPEN_WEBUI_URL: cfg.open_webui_url,
|
||||
NGC_API_KEY: cfg.ngc_api_key,
|
||||
BIND_PORT: String(uiPort),
|
||||
},
|
||||
},
|
||||
|
||||
@@ -1,10 +1,10 @@
|
||||
import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk'
|
||||
|
||||
export const v0_1_0 = VersionInfo.of({
|
||||
version: '0.3.0:1',
|
||||
version: '0.4.0:0',
|
||||
releaseNotes: {
|
||||
en_US:
|
||||
'v0.3: Spark hardware dashboard (RAM, disk, GPU memory + utilization, CPU load, uptime per Spark). Per-model Advanced settings now show plain-English hints tied to your actual GPU memory (e.g. "0.85 GPU util leaves ~18 GB free"). "Explain context" button on the update banner asks the loaded LLM to summarize pending commits in plain English. Optional Open WebUI URL in Configure Sparks shows a one-click "Open chat" button in the top bar. Downloads can now target Spark 1, Spark 2, or both. Each model card links out to its Hugging Face page.',
|
||||
'v0.4: install NIM containers from the dashboard. New "+ Install NIM" button next to the services panel shows a curated catalog (Parakeet, Magpie, Riva...) plus a free-form image field. Streams docker pull + docker run output with phase + elapsed timer; persists installed services to /data/services-overrides.yaml so they show up in the services panel after install. Configure Sparks now has an NGC API key field (masked) needed for nvcr.io. v0.3.1 hotfix bundled in: hardware/services SSH timeouts shortened (6 s) and failures cached for 25 s so an unreachable Spark doesn\'t hang the whole dashboard. Hardware card for an unreachable Spark now includes troubleshooting steps.',
|
||||
},
|
||||
migrations: {
|
||||
up: async ({ effects }) => {},
|
||||
|
||||
Reference in New Issue
Block a user