spark-control

Author	SHA1	Message	Date
Grant	75c0ecfd08	docs: update README with v0.2 feature summary v0.2	2026-05-12 11:31:14 -05:00
Grant	75fd0846b4	v0.2.3 - Per-model Advanced settings + catalog-add for downloaded models Backend: - overrides.py: read/write /data/models-overrides.yaml (knobs + custom entries) - apply_knobs_to_args(): strip matching flags from bundled vllm_args and append knob values, so knob changes properly override bundled defaults - extract_knobs_from_args(): seed UI knob values from bundled args so the Advanced dialog has correct starting state - models.py: load_catalog merges overrides on top of bundled yaml - GET /api/models returns effective_knobs per model - PUT /api/models/{key}/knobs persists knob changes - POST /api/models adds a custom catalog entry - DELETE /api/models/{key} removes a custom entry (bundled models cannot be deleted) - swap_manager.reload_catalog() called after each mutation so swaps see latest Frontend: - New 'Advanced' button on every card opens a modal dialog: max-model-len input, gpu-memory-utilization slider, three optimization checkboxes (fastsafetensors, prefix caching, FP8 KV cache). Save persists; Cancel discards. Custom models also have a Delete button. - After a successful download, automatically open the 'Add to catalog' dialog pre-filled with the repo, with the same knob defaults — user just enters key, display name, and clicks Save. - Custom catalog entries are tagged with a blue 'custom' pill on the card. Package: bump 0.2.3:0; main.ts sets MODELS_OVERRIDES=/data/models-overrides.yaml so overrides persist on the StartOS volume.	2026-05-12 11:30:47 -05:00
Grant	474417b458	v0.2.2 - spark-vllm-docker update checks + Apply Update Backend: - updates.py: get_update_status() runs git fetch + git rev-list --left-right --count HEAD...origin/main to learn ahead/behind/dirty, plus git log for pending commits - UpdateManager class with asyncio.Lock; one update at a time - POST /api/updates/apply triggers "git pull --ff-only && ./build-and-copy.sh -c" over SSH with streamed log + phase detection (Pulling / Building the vLLM container / Copying to peer Sparks) - GET /api/updates returns {ok, behind, ahead, dirty, current, log[], branch} Frontend: - Persistent banner near footer: hidden when up-to-date, blue when N commits behind, warn (orange) when local dirty changes block update - 'Show details' expands a list of pending commits - 'Apply update' triggers the long-running build with phase + elapsed timer + collapsible logs - Confirmation dialog explains the 5–40 min duration Package: bump 0.2.2:0	2026-05-12 11:26:55 -05:00
Grant	9dde938348	v0.2.1 - Model download with %% progress Backend: - download.py module: drives ./hf-download.sh <repo> [-c --copy-parallel] over SSH, parses tqdm output (regex matches '8%\|...\| 2.06G/25.1G [03:20<18:35, 20.6MB/s]') into percent + bytes done/total + elapsed + ETA + rate - DownloadManager: in-memory job tracking with asyncio.Lock (one download at a time) - POST /api/download, GET /api/download/{id}, SSE /api/download/{id}/stream - Phase detection: Connecting / Fetching N files / Downloading / Copying to peer Sparks / Done Frontend: - '+ Download a new model' button next to LLM swap section title - Inline form: HF repo text field + solo/cluster radio + Cancel/Start - Progress UI: spinner, elapsed timer, phase label, percent fill, stats line (bytes/rate/ETA), collapsible raw logs Package: bump 0.2.1:0	2026-05-12 11:24:31 -05:00
Grant	27699a2469	v0.2.0 - Always-on services panel with per-service host config Dashboard: - New 'Always-on services' section with cards for Parakeet and Magpie - Each card: host:port, model loaded, status pill (Healthy/Unhealthy/Starting/Not configured) - Start, Restart, Stop buttons. Buttons disabled when not applicable for current state - Restart counter shown when > 1 (would have surfaced the old magpie crash loop) Backend: - New /api/services GET: docker container state + http health for each support service - New POST /api/services/{name}/{action} for start \| stop \| restart - services.py module: docker_state, run_action via SSH - config.py: PARAKEET_HOST/USER/CONTAINER and MAGPIE_* env vars, default to spark2_* - health.py: use per-service hosts (no longer hard-wired to spark2_host) Package: - sparkConfig.yaml.ts: add 6 new optional fields - configureSparks action: optional 'Parakeet host', 'Parakeet container', 'Magpie host', 'Magpie container' fields; descriptions explain they default to Spark 2 when blank - Handler normalizes nulls to empty strings before merge - main.ts: pass new env vars to container - bump to 0.2.0:0	2026-05-12 11:21:15 -05:00
Grant	ed54f85442	known-issues: mark magpie crash loop RESOLVED with chown fix recipe Volume magpie-model-cache was owned by root, container drops to uid 1000. Fix: docker run --rm -v magpie-model-cache:/cache alpine chown -R 1000:1000 /cache + docker restart magpie-tts. After ~3 GB NGC model download, healthy on :9000.	2026-05-12 11:12:25 -05:00
Grant	4cda453c8a	0.1.0:4 - expose /api/endpoints as separate StartOS service interface Adds a second sdk.createInterface with type='api' and path='/api/endpoints' on the same uiPort (9999). StartOS dashboard now shows two service interfaces: Web UI and OpenAI-compatible API. The API URL is discoverable to other services without users needing to remember the /api/endpoints suffix.	2026-05-12 11:07:51 -05:00
Grant	2ba3da55b1	0.1.0:3 - Show Public Key layout + /api/endpoints service-discovery - showPublicKey now uses result.group: install command and raw key are each their own one-click copy box; description is brief - /api/endpoints returns stable shape { vllm, parakeet, magpie } with base_url + model + ready, for other LAN services to consume without hardcoding Spark IPs - health.py: parakeet/magpie now also expose base_url - README: documented /api/endpoints shape	2026-05-12 10:52:57 -05:00
Grant	51804b2e5e	0.1.0:2 - remove '<spark-user>' default everywhere (it's Alice's username, not factory) Per user correction: '<spark-user>' is not the DGX Spark factory default. Generic-ize: - configureSparks: no default user; placeholder 'your SSH username' - sparkConfig schema: empty string defaults - main.ts env fallback: empty - showPublicKey: drop the '<spark-user>' fallback; skip Spark if user not configured - Update feedback memory with the correction	2026-05-12 10:39:57 -05:00
Grant	0ddab99468	Bump to 0.1.0:1 — portability + endpoint display - configureSparks.ts: generic placeholders (e.g. 192.168.1.10), no Alice-specific IPs; descriptions explain the role of each node instead of naming his hardware - showPublicKey.ts: reads sparkConfig.yaml; emits a ready-to-paste one-liner (KEY='...' followed by 'ssh user@host "echo $KEY >> authorized_keys"' for each configured Spark). Falls back to generic instructions if Configure Sparks hasn't been run yet. - /api/status now includes vllm.base_url for the OpenAI endpoint - New endpoint panel in UI: base URL + model ID rows with copy buttons + collapsible curl example - Bump version to 0.1.0:1	2026-05-12 10:38:18 -05:00
Grant	87334f85f0	Add per-model descriptions + repo-cleanup polish - models.yaml: add 'description' field for all 5 models (generic, anyone-can-use) - ModelDef gains optional description: str \| None field - UI: render description below meta tags; mute the repo line further - escapeHtml() for safety in case descriptions/names contain HTML chars - Update runbook: how to add a new model with description	2026-05-12 10:19:09 -05:00
Grant	c0aebfc98b	Add friendly swap UI: timer + phase indicator + progress bar + collapsible logs - Elapsed timer (mm:ss) in top-right of swap panel - Phase display: Stopping / Starting / Loading weights (N/M shards) / Compiling / Warming up / Ready - Progress bar with smooth fill mapped from phase - Raw vLLM logs hidden behind <details> 'Show technical logs' - Detection from log content (safetensors %, torch.compile, Application startup, Ray cluster join) - Backfill from /api/swap/{id} on reattach (mid-swap reload works)	2026-05-12 10:11:14 -05:00
Grant	34bdbb7aba	Add Spark prerequisites section to runbook (spark-vllm-docker is upstream + Spark-side)	2026-05-12 10:05:17 -05:00
Grant	53a0b01d88	Update README with build flow + post-install steps; note IPv6/mDNS quirk	2026-05-12 10:03:37 -05:00
Grant	72bf754baa	Pack spark-control_x86_64.s9pk (55 MB) - Move models.yaml into image/ so the docker build context is self-contained - Fix manifest: dockerfile=../image/Dockerfile, workdir=../image - Add LICENSE (MIT) and assets/README.md (StartOS marketplace listing) - s9pk validates: id=spark-control, version=0.1.0:0, osVersion=0.4.0-beta.6, sdkVersion=1.3.3 - Image embeds python:3.12-slim + openssh-client + FastAPI app + models.yaml	2026-05-12 09:52:53 -05:00
Grant	342e150266	Add safe optimization flags to gemma4 + qwen36 (fastsafetensors, prefix-caching, fp8 kv) Aligned with sibling recipes in eugr/spark-vllm-docker. Applies on next swap to each model. First real swap gemma4 -> qwen36 succeeded in 5:30 with --moe_backend=flashinfer_cutlass.	2026-05-12 09:49:08 -05:00
Grant	dd9d53060b	Add StartOS 0.4 package scaffold (manifest, main, interfaces, 2 actions) - package/Makefile + s9pk.mk + package.json + tsconfig.json - startos/manifest: dockerBuild source pointing at ../image/Dockerfile - startos/main: reads /data/config.yaml reactively, passes env vars to container - startos/interfaces: binds port 9999 as HTTP UI - startos/actions: showPublicKey (read /data/ssh/id_ed25519.pub), configureSparks - TS + JS bundle compile clean (tsc --noEmit, ncc build)	2026-05-12 09:36:15 -05:00
Grant	ae8efa1754	Initial scaffold: image/ FastAPI app, models.yaml, docs - image/ FastAPI app: /api/status, /api/swap, /api/swap/{id}/stream, /api/test-connection - models.yaml: 5-model catalog (qwen3-vl, gemma4, qwen36, qwen3-235b-fp8, qwen25-72b) - README, runbook, known-issues - Dry-run swap verified against live Spark 1 (gemma4 currently loaded)	2026-05-12 09:29:13 -05:00

18 Commits