spark-control

Author	SHA1	Message	Date
Grant	5827683a09	v0.6.0:1 - fix Qwen3.6 Mamba block-size assertion at launch vLLM trips on launching Qwen3.6-35B-A3B-NVFP4 with: AssertionError: In Mamba cache align mode, block_size (2096) must be <= max_num_batched_tokens (2048). Qwen3.6 uses a Mamba-attention hybrid. The default --max-num-batched-tokens of 2048 is just under the model's required block_size of 2096. The upstream sibling recipe (qwen3.5-35b-a3b-fp8.yaml) sets it to 16384; use the same value. Earlier qwen36 swaps in this session worked because vLLM hadn't reached the Mamba-validation code path on that prior path (different attention backend pick or auto-retry). Whatever the reason, the explicit flag avoids the dance. Also documented in known-issues.md.	2026-05-12 13:22:24 -05:00
Grant	ee8c2406b8	v0.6.0 - Service-level connectivity tracking + passive failure-report endpoint connectivity.py: - Generalized 'spark' subject to any string; renamed 'spark' field to 'subject' - Legacy v0.5 events with the old 'spark' field are migrated transparently on read (kind defaults to 'transition') - New record_report(subject, ok, source, detail, latency_ms): always appends an event with kind='report'; does NOT mutate the current state (only active polling is authoritative) - summary() returns events normalized to the new schema Wiring: - /api/status now calls record_state for vllm/parakeet/magpie (dedup on no-change) - /api/services calls record_state for each service after its http check - Result: dashboard observes service-level transitions automatically with no extra polling Passive endpoint: - POST /api/health-event with {service, ok, source?, error?, ms?} - Useful for external apps (e.g. Open WebUI) to surface sub-poll-interval failures the dashboard would otherwise miss UI: - Connectivity dialog groups events by subject (hosts ordered first, then services) - Per-subject summary shows transition count, down count, report count, failed-report count - Transitions and reports render inline with distinct styling; reports show source app + error + latency - Legacy v0.5 events render unchanged Docs: - README documents /api/health-event with a curl example Package: bump to 0.6.0:0	2026-05-12 13:19:27 -05:00
Grant	a02f4db850	v0.5.0 - Wake-on-LAN + connectivity history wol.py: - build_magic_packet(): standard 6x0xFF + 16x MAC layout - send_local_broadcast(): direct from container (ports 9 + 7 for safety) - send_via_peer(): preferred path; SSHes to the OTHER Spark and runs a Python one-liner there so the packet originates on the target's LAN segment (most reliable) - MAC validation + normalization connectivity.py: - /data/connectivity.json persistence (thread-safe, atomic rename) - Stores per-Spark current state + last_change timestamp + rolling 200-event log - Records up/down transitions; computes down_seconds / up_seconds durations - MAC cache populated lazily during hardware probes hardware.py: - Probe now reads MAC via /sys/class/net/<default-route-iface>/address - After each probe, record_state() emits a transition event if state changed - record_mac() caches the address so WoL works when the Spark next goes down Endpoints: - GET /api/connectivity: macs, current state, last_change, events[] - POST /api/spark/{name}/wake: tries via-peer first, falls back to direct broadcast UI: - Unreachable hardware card shows the cached MAC + 'Wake (WoL)' button (only if MAC known) - New 'Connectivity log' button opens a modal with per-Spark transition history (last 25 each), including duration of each prior up/down period - pollHardware also pulls /api/connectivity so WoL buttons appear without an extra fetch Package: bump 0.5.0:0; main.ts sets CONNECTIVITY_LOG=/data/connectivity.json	2026-05-12 12:51:49 -05:00
Grant	1889ab45fb	v0.4.0 - NIM installer + dashboard resilience Hotfix (was v0.3.1): - services.py: cache 'unreachable' per (host,user) for 25s so a dead Spark doesn't hang every /api/services call behind 6s ssh timeout - ssh_run timeout reduced 10 -> 6s for docker_state probes - hardware probe: shorter SSH timeout (6s), longer cache TTL for failures (25s) - JS pollStatus retries loadModels() if state.models is empty (recovers from cold-start proxy timeout) - Unreachable hardware card now includes troubleshooting steps (Spark Control cannot SSH into an unreachable Spark to restart it) v0.4 NIM installer: - nim.py module: curated SUGGESTED_NIMS list (Parakeet, Magpie, Riva) + NimManager that runs docker login nvcr.io + docker pull + docker run -d --gpus all -p PORT:PORT -v VOL:/opt/nim/.cache -e NGC_API_KEY -e ... --restart=unless-stopped + chown the volume to uid 1000 + restart. Streams all output via SSE; redacts the API key from log lines. - custom_services.py: persists installed NIMs to /data/services-overrides.yaml so they appear in the services panel after install - services.py: merges custom services into the panel - /api/nim/catalog GET, /api/nim/install POST + GET/SSE - /api/services/{name} DELETE for custom services - UI: '+ Install NIM' button next to 'Always-on services'; modal lists curated images each with a 'Pick' button + a custom-image form; installation runs in a second dialog with phase + elapsed timer + collapsible log - NGC API key field added to Configure Sparks (masked); injected as NGC_API_KEY env var into the container Package: bump 0.4.0:0; main.ts adds SERVICES_OVERRIDES + NGC_API_KEY env vars	2026-05-12 12:32:29 -05:00
Grant	e88fdcfde4	v0.3.0:1 - hotfix: parallel SSH probes + longer timeout - Hardware probes for spark1 and spark2 now run via asyncio.gather (parallel) so the worst-case wall time is max(per-probe), not sum - Bump per-probe SSH timeout from 8s to 12s to absorb first-call overhead (StrictHostKeyChecking=accept-new on first connect + nvidia-smi cold start) - Unreachable Spark now shows up cleanly in the UI as a single 'unreachable' card with the error message	2026-05-12 12:14:36 -05:00
Grant	64ce0fca10	v0.3.0 - Hardware dashboard + knob context + Explain context + Open WebUI link Hardware dashboard: - New hardware.py module: SSH probes each Spark for hostname, uptime, load+cores, RAM, disk, GPU (name, util, temp, power) + per-process GPU memory sum - DGX Spark uses unified memory (nvidia-smi memory.total returns N/A); fall back to per-process compute memory and compute fraction against system RAM. Marks with gpu_unified_memory=true. - 4s TTL cache in HardwareProbe to avoid hammering - /api/hardware returns per-Spark snapshot - UI: 'Spark hardware' section at the top with per-Spark cards (CPU load, RAM, GPU mem (unified), GPU util + temp + power, disk) — bars with warn threshold styling - Polls every 8s Knob context (tied to live hardware): - Each Advanced knob now shows plain-English help text - 'GPU memory %' shows '~N GB allocated · ~M GB left for OS/buffers' computed from actual Spark RAM - 'Max context' shows '~N pages of text' - Toggles show tradeoff descriptions Explain context: - '✨ Explain context' button on the update banner - /api/explain-updates POST: forwards pending commits to the loaded vLLM model and streams its response back as SSE - Renders into an expandable 'Explained by the loaded LLM' section under Pending commits - Reasoning tokens shown italicized when the model emits them Open WebUI integration: - New 'Open WebUI URL' optional field in Configure Sparks - /api/config exposes it; UI shows 'Open chat ↗' button in the top bar if set Downloads: - Third radio option: Spark 1 only / Spark 2 only / Both Sparks - Backend picks SSH target based on mode - HF repo link icon next to the input - Helper line about NVFP4 for Blackwell Model cards: - Repo name is now a clickable link to its Hugging Face page Package: bump 0.3.0:0	2026-05-12 12:00:15 -05:00
Grant	c6da6b0784	v0.2.4 - Hotfix: Unknown status + copy UX + update banner context Bug fix: - config.py: empty PARAKEET_CONTAINER / MAGPIE_CONTAINER env vars (from migrating to v0.2.0+ where the field is optional and saved as '') now fall back to 'parakeet-asr' / 'magpie-tts' via the 'or' idiom. Confirmed live: services classify as 'running' instead of 'unknown'. UX: - Replaced text 'Copy' buttons with compact icon buttons (clipboard SVG) - Endpoint Base URL + Model ID + curl snippet are now click-to-copy themselves (the value AND a separate icon button) - Service cards: host, base URL, and model are now three separate copyable rows - Update banner: leading explanatory line — 'Updates to eugr/spark-vllm-docker — the upstream project that orchestrates vLLM on your Sparks. These are not firmware, OS, or model updates.' with a link to the repo.	2026-05-12 11:45:55 -05:00
Grant	75fd0846b4	v0.2.3 - Per-model Advanced settings + catalog-add for downloaded models Backend: - overrides.py: read/write /data/models-overrides.yaml (knobs + custom entries) - apply_knobs_to_args(): strip matching flags from bundled vllm_args and append knob values, so knob changes properly override bundled defaults - extract_knobs_from_args(): seed UI knob values from bundled args so the Advanced dialog has correct starting state - models.py: load_catalog merges overrides on top of bundled yaml - GET /api/models returns effective_knobs per model - PUT /api/models/{key}/knobs persists knob changes - POST /api/models adds a custom catalog entry - DELETE /api/models/{key} removes a custom entry (bundled models cannot be deleted) - swap_manager.reload_catalog() called after each mutation so swaps see latest Frontend: - New 'Advanced' button on every card opens a modal dialog: max-model-len input, gpu-memory-utilization slider, three optimization checkboxes (fastsafetensors, prefix caching, FP8 KV cache). Save persists; Cancel discards. Custom models also have a Delete button. - After a successful download, automatically open the 'Add to catalog' dialog pre-filled with the repo, with the same knob defaults — user just enters key, display name, and clicks Save. - Custom catalog entries are tagged with a blue 'custom' pill on the card. Package: bump 0.2.3:0; main.ts sets MODELS_OVERRIDES=/data/models-overrides.yaml so overrides persist on the StartOS volume.	2026-05-12 11:30:47 -05:00
Grant	474417b458	v0.2.2 - spark-vllm-docker update checks + Apply Update Backend: - updates.py: get_update_status() runs git fetch + git rev-list --left-right --count HEAD...origin/main to learn ahead/behind/dirty, plus git log for pending commits - UpdateManager class with asyncio.Lock; one update at a time - POST /api/updates/apply triggers "git pull --ff-only && ./build-and-copy.sh -c" over SSH with streamed log + phase detection (Pulling / Building the vLLM container / Copying to peer Sparks) - GET /api/updates returns {ok, behind, ahead, dirty, current, log[], branch} Frontend: - Persistent banner near footer: hidden when up-to-date, blue when N commits behind, warn (orange) when local dirty changes block update - 'Show details' expands a list of pending commits - 'Apply update' triggers the long-running build with phase + elapsed timer + collapsible logs - Confirmation dialog explains the 5–40 min duration Package: bump 0.2.2:0	2026-05-12 11:26:55 -05:00
Grant	9dde938348	v0.2.1 - Model download with %% progress Backend: - download.py module: drives ./hf-download.sh <repo> [-c --copy-parallel] over SSH, parses tqdm output (regex matches '8%\|...\| 2.06G/25.1G [03:20<18:35, 20.6MB/s]') into percent + bytes done/total + elapsed + ETA + rate - DownloadManager: in-memory job tracking with asyncio.Lock (one download at a time) - POST /api/download, GET /api/download/{id}, SSE /api/download/{id}/stream - Phase detection: Connecting / Fetching N files / Downloading / Copying to peer Sparks / Done Frontend: - '+ Download a new model' button next to LLM swap section title - Inline form: HF repo text field + solo/cluster radio + Cancel/Start - Progress UI: spinner, elapsed timer, phase label, percent fill, stats line (bytes/rate/ETA), collapsible raw logs Package: bump 0.2.1:0	2026-05-12 11:24:31 -05:00
Grant	27699a2469	v0.2.0 - Always-on services panel with per-service host config Dashboard: - New 'Always-on services' section with cards for Parakeet and Magpie - Each card: host:port, model loaded, status pill (Healthy/Unhealthy/Starting/Not configured) - Start, Restart, Stop buttons. Buttons disabled when not applicable for current state - Restart counter shown when > 1 (would have surfaced the old magpie crash loop) Backend: - New /api/services GET: docker container state + http health for each support service - New POST /api/services/{name}/{action} for start \| stop \| restart - services.py module: docker_state, run_action via SSH - config.py: PARAKEET_HOST/USER/CONTAINER and MAGPIE_* env vars, default to spark2_* - health.py: use per-service hosts (no longer hard-wired to spark2_host) Package: - sparkConfig.yaml.ts: add 6 new optional fields - configureSparks action: optional 'Parakeet host', 'Parakeet container', 'Magpie host', 'Magpie container' fields; descriptions explain they default to Spark 2 when blank - Handler normalizes nulls to empty strings before merge - main.ts: pass new env vars to container - bump to 0.2.0:0	2026-05-12 11:21:15 -05:00
Grant	4cda453c8a	0.1.0:4 - expose /api/endpoints as separate StartOS service interface Adds a second sdk.createInterface with type='api' and path='/api/endpoints' on the same uiPort (9999). StartOS dashboard now shows two service interfaces: Web UI and OpenAI-compatible API. The API URL is discoverable to other services without users needing to remember the /api/endpoints suffix.	2026-05-12 11:07:51 -05:00
Grant	2ba3da55b1	0.1.0:3 - Show Public Key layout + /api/endpoints service-discovery - showPublicKey now uses result.group: install command and raw key are each their own one-click copy box; description is brief - /api/endpoints returns stable shape { vllm, parakeet, magpie } with base_url + model + ready, for other LAN services to consume without hardcoding Spark IPs - health.py: parakeet/magpie now also expose base_url - README: documented /api/endpoints shape	2026-05-12 10:52:57 -05:00
Grant	51804b2e5e	0.1.0:2 - remove '<spark-user>' default everywhere (it's Alice's username, not factory) Per user correction: '<spark-user>' is not the DGX Spark factory default. Generic-ize: - configureSparks: no default user; placeholder 'your SSH username' - sparkConfig schema: empty string defaults - main.ts env fallback: empty - showPublicKey: drop the '<spark-user>' fallback; skip Spark if user not configured - Update feedback memory with the correction	2026-05-12 10:39:57 -05:00
Grant	0ddab99468	Bump to 0.1.0:1 — portability + endpoint display - configureSparks.ts: generic placeholders (e.g. 192.168.1.10), no Alice-specific IPs; descriptions explain the role of each node instead of naming his hardware - showPublicKey.ts: reads sparkConfig.yaml; emits a ready-to-paste one-liner (KEY='...' followed by 'ssh user@host "echo $KEY >> authorized_keys"' for each configured Spark). Falls back to generic instructions if Configure Sparks hasn't been run yet. - /api/status now includes vllm.base_url for the OpenAI endpoint - New endpoint panel in UI: base URL + model ID rows with copy buttons + collapsible curl example - Bump version to 0.1.0:1	2026-05-12 10:38:18 -05:00
Grant	dd9d53060b	Add StartOS 0.4 package scaffold (manifest, main, interfaces, 2 actions) - package/Makefile + s9pk.mk + package.json + tsconfig.json - startos/manifest: dockerBuild source pointing at ../image/Dockerfile - startos/main: reads /data/config.yaml reactively, passes env vars to container - startos/interfaces: binds port 9999 as HTTP UI - startos/actions: showPublicKey (read /data/ssh/id_ed25519.pub), configureSparks - TS + JS bundle compile clean (tsc --noEmit, ncc build)	2026-05-12 09:36:15 -05:00

16 Commits