spark-control

Author	SHA1	Message	Date
Keysat	136a4713a1	v0.22.0:0 - configurable vllm port; gitea-release tooling; coexistence roadmap - Configure Sparks gains a vLLM port field (blank => 8888, our launch-cluster.sh default); VLLM_PORT plumbed configureSparks -> sparkConfig.yaml -> main.ts env -> config.py. So an adopter whose vLLM listens elsewhere (e.g. 8000) can fix the "vLLM unreachable" health check without rebuilding the package. - Harden numeric env parsing (config._env_int): a blank or malformed port now falls back to its default instead of crashing daemon startup (closes a P3 tech-debt item; the Configure panel passes unset optional fields as ""). - Add scripts/gitea-release.sh + `make release` to publish the built s9pk to Gitea Releases, so the OpenClaw adopter pulls updates with a read-only token instead of being hand-sent the package. - Capture the OpenClaw/Johnny-5 coexistence epic and the "control plane, not a job runner" stance in ROADMAP.md and Current state.	2026-06-17 19:45:09 -05:00
Keysat	39f8410623	v0.21.0:0 - matrix-bridge bot tile (status, update, restart, logs)	2026-06-15 22:57:40 -05:00
Keysat	9ef9226e0a	docs: split CLAUDE.md into path-scoped .claude/rules; fix dev/test commands - CLAUDE.md trimmed to whole-repo facts (58 lines); subsystem guidance moved to .claude/rules/{startos-package,fastapi-image,redaction, audio-speech}.md with paths: frontmatter so each loads only when matching files are touched - .gitignore: track .claude/rules/ while keeping the rest of .claude/ (settings.local.json) ignored - test-audio-with-speakers.sh: require audio-file arg in docs, replace owner-specific SPARK_CONTROL/VLLM defaults with generic ones (localhost dev server + Spark Control vLLM proxy), discover the loaded LLM via /api/status since /v1/models lists audio models only - document REDACTION_MAP_DB + CONNECTIVITY_LOG as required for local dev (/data only exists in the container) - prettier pass over startos/actions (formatting drift)	2026-06-11 19:12:23 -05:00
Keysat	8d839e3714	v0.13.0:4 - redaction gateway, embeddings proxy, expanded audio API - Add redaction gateway (redaction_gateway.py, redaction/ scrub + tests) - Add embeddings proxy and spark_embed service (Dockerfile + main.py) - Expand audio_proxy with speaker-aware handling; deep_health/health/server updates - Package: configureSparks action + sparkConfig model updates, manifest/main wiring - Docs: AUDIO_API, EMBEDDINGS, REDACTION_GATEWAY; HANDOFF and runbook/known-issues refresh	2026-06-11 17:45:57 -05:00
Grant	1889ab45fb	v0.4.0 - NIM installer + dashboard resilience Hotfix (was v0.3.1): - services.py: cache 'unreachable' per (host,user) for 25s so a dead Spark doesn't hang every /api/services call behind 6s ssh timeout - ssh_run timeout reduced 10 -> 6s for docker_state probes - hardware probe: shorter SSH timeout (6s), longer cache TTL for failures (25s) - JS pollStatus retries loadModels() if state.models is empty (recovers from cold-start proxy timeout) - Unreachable hardware card now includes troubleshooting steps (Spark Control cannot SSH into an unreachable Spark to restart it) v0.4 NIM installer: - nim.py module: curated SUGGESTED_NIMS list (Parakeet, Magpie, Riva) + NimManager that runs docker login nvcr.io + docker pull + docker run -d --gpus all -p PORT:PORT -v VOL:/opt/nim/.cache -e NGC_API_KEY -e ... --restart=unless-stopped + chown the volume to uid 1000 + restart. Streams all output via SSE; redacts the API key from log lines. - custom_services.py: persists installed NIMs to /data/services-overrides.yaml so they appear in the services panel after install - services.py: merges custom services into the panel - /api/nim/catalog GET, /api/nim/install POST + GET/SSE - /api/services/{name} DELETE for custom services - UI: '+ Install NIM' button next to 'Always-on services'; modal lists curated images each with a 'Pick' button + a custom-image form; installation runs in a second dialog with phase + elapsed timer + collapsible log - NGC API key field added to Configure Sparks (masked); injected as NGC_API_KEY env var into the container Package: bump 0.4.0:0; main.ts adds SERVICES_OVERRIDES + NGC_API_KEY env vars	2026-05-12 12:32:29 -05:00
Grant	64ce0fca10	v0.3.0 - Hardware dashboard + knob context + Explain context + Open WebUI link Hardware dashboard: - New hardware.py module: SSH probes each Spark for hostname, uptime, load+cores, RAM, disk, GPU (name, util, temp, power) + per-process GPU memory sum - DGX Spark uses unified memory (nvidia-smi memory.total returns N/A); fall back to per-process compute memory and compute fraction against system RAM. Marks with gpu_unified_memory=true. - 4s TTL cache in HardwareProbe to avoid hammering - /api/hardware returns per-Spark snapshot - UI: 'Spark hardware' section at the top with per-Spark cards (CPU load, RAM, GPU mem (unified), GPU util + temp + power, disk) — bars with warn threshold styling - Polls every 8s Knob context (tied to live hardware): - Each Advanced knob now shows plain-English help text - 'GPU memory %' shows '~N GB allocated · ~M GB left for OS/buffers' computed from actual Spark RAM - 'Max context' shows '~N pages of text' - Toggles show tradeoff descriptions Explain context: - '✨ Explain context' button on the update banner - /api/explain-updates POST: forwards pending commits to the loaded vLLM model and streams its response back as SSE - Renders into an expandable 'Explained by the loaded LLM' section under Pending commits - Reasoning tokens shown italicized when the model emits them Open WebUI integration: - New 'Open WebUI URL' optional field in Configure Sparks - /api/config exposes it; UI shows 'Open chat ↗' button in the top bar if set Downloads: - Third radio option: Spark 1 only / Spark 2 only / Both Sparks - Backend picks SSH target based on mode - HF repo link icon next to the input - Helper line about NVFP4 for Blackwell Model cards: - Repo name is now a clickable link to its Hugging Face page Package: bump 0.3.0:0	2026-05-12 12:00:15 -05:00
Grant	27699a2469	v0.2.0 - Always-on services panel with per-service host config Dashboard: - New 'Always-on services' section with cards for Parakeet and Magpie - Each card: host:port, model loaded, status pill (Healthy/Unhealthy/Starting/Not configured) - Start, Restart, Stop buttons. Buttons disabled when not applicable for current state - Restart counter shown when > 1 (would have surfaced the old magpie crash loop) Backend: - New /api/services GET: docker container state + http health for each support service - New POST /api/services/{name}/{action} for start \| stop \| restart - services.py module: docker_state, run_action via SSH - config.py: PARAKEET_HOST/USER/CONTAINER and MAGPIE_* env vars, default to spark2_* - health.py: use per-service hosts (no longer hard-wired to spark2_host) Package: - sparkConfig.yaml.ts: add 6 new optional fields - configureSparks action: optional 'Parakeet host', 'Parakeet container', 'Magpie host', 'Magpie container' fields; descriptions explain they default to Spark 2 when blank - Handler normalizes nulls to empty strings before merge - main.ts: pass new env vars to container - bump to 0.2.0:0	2026-05-12 11:21:15 -05:00
Grant	2ba3da55b1	0.1.0:3 - Show Public Key layout + /api/endpoints service-discovery - showPublicKey now uses result.group: install command and raw key are each their own one-click copy box; description is brief - /api/endpoints returns stable shape { vllm, parakeet, magpie } with base_url + model + ready, for other LAN services to consume without hardcoding Spark IPs - health.py: parakeet/magpie now also expose base_url - README: documented /api/endpoints shape	2026-05-12 10:52:57 -05:00
Grant	51804b2e5e	0.1.0:2 - remove '<spark-user>' default everywhere (it's Alice's username, not factory) Per user correction: '<spark-user>' is not the DGX Spark factory default. Generic-ize: - configureSparks: no default user; placeholder 'your SSH username' - sparkConfig schema: empty string defaults - main.ts env fallback: empty - showPublicKey: drop the '<spark-user>' fallback; skip Spark if user not configured - Update feedback memory with the correction	2026-05-12 10:39:57 -05:00
Grant	0ddab99468	Bump to 0.1.0:1 — portability + endpoint display - configureSparks.ts: generic placeholders (e.g. 192.168.1.10), no Alice-specific IPs; descriptions explain the role of each node instead of naming his hardware - showPublicKey.ts: reads sparkConfig.yaml; emits a ready-to-paste one-liner (KEY='...' followed by 'ssh user@host "echo $KEY >> authorized_keys"' for each configured Spark). Falls back to generic instructions if Configure Sparks hasn't been run yet. - /api/status now includes vllm.base_url for the OpenAI endpoint - New endpoint panel in UI: base URL + model ID rows with copy buttons + collapsible curl example - Bump version to 0.1.0:1	2026-05-12 10:38:18 -05:00
Grant	dd9d53060b	Add StartOS 0.4 package scaffold (manifest, main, interfaces, 2 actions) - package/Makefile + s9pk.mk + package.json + tsconfig.json - startos/manifest: dockerBuild source pointing at ../image/Dockerfile - startos/main: reads /data/config.yaml reactively, passes env vars to container - startos/interfaces: binds port 9999 as HTTP UI - startos/actions: showPublicKey (read /data/ssh/id_ed25519.pub), configureSparks - TS + JS bundle compile clean (tsc --noEmit, ncc build)	2026-05-12 09:36:15 -05:00

11 Commits