spark-control

Author	SHA1	Message	Date
Keysat	7e0759846f	v0.27.0:0 - in-app settings gear + swap-lock route fix Move the ~20 optional cluster knobs out of the StartOS "Configure Sparks" action (now just the 4 required fields) and into a dashboard ⚙ Settings gear, backed by a /data/app_settings.json overlay keyed by env-var names. One shared mutable Settings instance + Settings.reload() applies edits live without a restart; existing installs' values migrate automatically on first boot. Also: support-service ports (parakeet/kokoro/embed/qdrant + vllm) are now configurable, and GET /api/swap/lock no longer 404s (it was shadowed by the /api/swap/{job_id} catch-all). WebhookNotifier is re-pointed on save so its url/secret reload live too.	2026-06-18 13:41:28 -05:00
Keysat	b67e001642	docs: v0.26.0:0 live + published to registry; surface Gemma-26B eval as next	2026-06-18 12:35:16 -05:00
Keysat	df9f244eae	v0.26.0:0 - disk-driven model menu (scan sparks; recipes; needs-setup) The dashboard menu is now the set of models actually downloaded on the Sparks, not a hard-coded catalog. models.yaml + overrides are reframed as launch recipes matched to an on-disk model by repo; an on-disk model with no recipe is flagged needs_setup and its launch settings are inferred from its config.json for a one-time operator confirmation (discovery.py). - delete now removes weights AND the menu card (delete_from_disk sweeps all hosts; the delete endpoint resolves keys via the live menu) - new GET /api/models/suggest; /api/models returns the menu + a recipes list (download autocomplete); GET /api/models/disk-status removed - dropped the two legacy Qwen recipes (235B FP8, 2.5 72B) - tests: +test_discovery.py (cache parsing, infer_recipe, build_menu merge)	2026-06-18 11:09:56 -05:00
Keysat	c0b35184ba	docs: trim Current state to live status — coordination epic shipped	2026-06-18 08:09:59 -05:00
Keysat	7ecd77f1e5	docs: defer raw-docker swap generalization — multi-node rationale recorded	2026-06-18 07:58:25 -05:00
Keysat	6bcda6e348	docs: v0.25.0:0 installed live — update Current state	2026-06-18 07:11:33 -05:00
Keysat	7ae6ab3ba8	v0.25.0:0 - cluster coordination layer (swap lock + webhook + schedule registry) GPU-arbiter safety layer for when automation, not just the dashboard, swaps models: - swap reservation lock (POST/GET/DELETE /api/swap/lock); 423-enforced in post_swap via a single-read gate, TTL-bounded, secret-token auth, human force-release override + dashboard banner - swap webhook (swap_complete/swap_failed) fired outside the swap lock, optional HMAC signature, configurable URL+secret - read-only schedule registry (GET/POST/DELETE /api/schedule) + dashboard panel New module image/app/coordination.py; docs/COORDINATION.md for consumers; 22 offline tests in test_coordination.py.	2026-06-18 07:07:08 -05:00
Keysat	dd3d1412d4	docs: v0.24.0:0 committed/tagged/pushed — Gitea release asset + live install still pending	2026-06-17 23:11:14 -05:00
Keysat	26070eb191	v0.24.0:0 - configurable cluster topology (vllm container name, hide services, second-vllm monitor) Make the cluster topology configurable so an adopter wired differently (vLLM on both Sparks, port 8000, different container name, no Parakeet) can monitor without forking. Covers the OpenClaw report P4/P5/#6. - VLLM_CONTAINER override (default vllm_node), validated at the boundary and quote_arg-quoted into the swap log-tail + pre-flight validator exec. - DISABLED_SERVICES list: hidden services show no tile and are skipped by status/deep-health/connectivity probes (kills the Parakeet-on-8000 collision). - kind: vllm custom service monitors a second Spark's vLLM via the shared probe_vllm_endpoint; /api/endpoints gains a disabled flag. Swap mechanism intentionally not generalized to raw docker run (that's coordination, roadmap item 4).	2026-06-17 23:03:33 -05:00
Keysat	90394f891b	docs: v0.23.0 published, live install pending (mDNS); runbook sideload troubleshooting	2026-06-17 22:36:41 -05:00
Keysat	e783653ef0	v0.23.0:0 - local / fine-tuned model support Add models that live as a directory on a Spark (e.g. LoRA-merged fine-tunes), not just Hugging Face repos. - ModelDef gains local_path; a model must set exactly one of repo / local_path. The validator also enforces the local-path whitelist and that any --chat-template lives inside local_path (only that dir is mounted). - build_launch_command bind-mounts the dir into the vLLM container at the SAME host==container path via the launch script's VLLM_SPARK_EXTRA_DOCKER_ARGS hook, then `vllm serve <dir>`. No launch-cluster.sh change (verified the upstream expands that var unquoted; contract noted in runbook.md). - shellsafe.validate_local_path: absolute path, charset whitelist, no '.'/'..'. - POST /api/models validates the full entry via ModelDef before persisting, so a bad entry can't be written and then break catalog load; _merge_overrides skips an invalid override entry instead of failing the whole catalog. - disk.py size-probes a local path with du; disk-delete refused for local models. - UI: "+ Add local model" dialog, `local` badge, path shown instead of an HF link, delete button hidden for local models. - Tests: local launch + injection round-trip, chat-template location, traversal, exactly-one-source, _merge_overrides skip-invalid (94 pass). Reviewer-agent pass; findings addressed.	2026-06-17 22:27:41 -05:00
Keysat	56f7ea4444	fix: gitea-release.sh tolerate 404 on tag lookup; report HTTP errors; mark v0.22.0 published	2026-06-17 21:23:21 -05:00
Keysat	aaad57d88f	docs: mark v0.22.0:0 shipped + record Gitea-release distribution decision	2026-06-17 19:47:49 -05:00
Keysat	136a4713a1	v0.22.0:0 - configurable vllm port; gitea-release tooling; coexistence roadmap - Configure Sparks gains a vLLM port field (blank => 8888, our launch-cluster.sh default); VLLM_PORT plumbed configureSparks -> sparkConfig.yaml -> main.ts env -> config.py. So an adopter whose vLLM listens elsewhere (e.g. 8000) can fix the "vLLM unreachable" health check without rebuilding the package. - Harden numeric env parsing (config._env_int): a blank or malformed port now falls back to its default instead of crashing daemon startup (closes a P3 tech-debt item; the Configure panel passes unset optional fields as ""). - Add scripts/gitea-release.sh + `make release` to publish the built s9pk to Gitea Releases, so the OpenClaw adopter pulls updates with a read-only token instead of being hand-sent the package. - Capture the OpenClaw/Johnny-5 coexistence epic and the "control plane, not a job runner" stance in ROADMAP.md and Current state.	2026-06-17 19:45:09 -05:00
Keysat	c179389731	docs: trim Current state post-matrix-bridge ship; add bot-tile ops note to runbook	2026-06-15 23:18:28 -05:00
Keysat	9debeb4bbe	v0.21.0:1 - tidy host display for port-less bot tile	2026-06-15 23:09:24 -05:00
Keysat	39f8410623	v0.21.0:0 - matrix-bridge bot tile (status, update, restart, logs)	2026-06-15 22:57:40 -05:00
Keysat	e307a08f05	docs: refresh Current state for handoff — harness shipped, parakeet deferred, finished narrative pruned	2026-06-15 18:32:57 -05:00
Keysat	6238ac88f7	test: add offline pytest harness (build_launch_command injection, label-merge)	2026-06-15 17:24:49 -05:00
Keysat	e87158c492	v0.20.0:0 - per-spark ssh-key copy + wireguard status badge	2026-06-15 09:53:40 -05:00
Keysat	5341fcc506	Add inbox-check line; align .gitignore with canonical .claude policy Cross-repo git-hygiene audit remediation: surface ~/Projects/standards/INBOX.md items at session start, and switch .gitignore to the deny-by-default .claude/* block (shared wiring allow-listed) plus the canonical secrets/env lines — per standards/portability.md.	2026-06-14 12:17:16 -05:00
Keysat	05d03beeeb	docs: handoff — trim Current state, move full-eval debt to ROADMAP, record SSH-input + CSRF conventions - AGENTS.md: rewrite Current state lean for v0.19.0:0; drop the now-completed full-eval triage block (history lives in git log + EVALUATION.md). - docs/guides/fastapi-image.md: add two durable conventions — user values crossing into SSH must go through shellsafe; new endpoints and the csrf_guard exempt-prefix rule. - ROADMAP.md: park the remaining non-blocking P2/P3 tech debt from the eval.	2026-06-12 17:10:03 -05:00
Keysat	56a519ff4f	docs: record git-history scrub; fix stale SHAs and IP-fragment remnants History was rewritten with git filter-repo to purge owner-specific values (IPs, hostnames, SSH username, key name, personal names) from all commits, tags, and messages — including three LAN IPs and one Start9 address the v0.18.0:1 working-tree scrub had missed (one still live in HEAD at docs/AUDIO_API.md). Verified 0 hits across all refs. - AGENTS.md: Portability + Repo-wart + work-queue #2 + shipping note updated; commit-SHA references repointed to post-rewrite SHAs (367d986->8d839e3). - EVALUATION.md: P0 owner-data finding marked resolved; cleaned shorthand IP-octet fragments (/.87, /11) left by the placeholder substitution.	2026-06-12 16:55:08 -05:00
Keysat	1c4e861783	v0.19.0:0 - harden cluster-control surface: ssh injection, qdrant path, csrf Triaged from a full independent evaluation (EVALUATION.md). Addresses the three P0/P1 code findings; the proxy/data APIs that downstream apps consume are deliberately untouched. - ssh command injection (P0): new shellsafe.py validates + shlex.quotes every user-supplied value crossing into an SSH command on the Sparks (model repo, vllm args/knobs, NIM image/container/volume/port/env, service names). Boundary validation on POST /api/models and POST /api/nim/install; quoting at every sink in models/download/nim/services. NGC key now quoted too. - qdrant path injection (P1): /api/search validates the collection name against a metacharacter-free whitelist and URL-encodes the path segment. - csrf (P1): csrf_guard middleware enforces same-origin on state-changing control endpoints; /v1/, /scrub, /rehydrate, /api/search, /api/audio/ and /api/health-event are exempt so external consumers are unaffected. Verified: injection survives only as a single quoted token, vLLM preflight shlex.split round-trip intact, CSRF behaviors covered via TestClient, both offline redaction suites still pass, tsc clean, s9pk rebuilt.	2026-06-12 16:36:33 -05:00
Keysat	98988057a2	v0.18.0:1 - scrub owner-specific hostnames, ips, usernames, names from tracked files Replace real cluster IPs/hosts/usernames and example names with neutral placeholders across docs, ops notes, package install text, and the offline redaction test; delete the obsolete build-time starter prompt. Closes the portability audit's single blocker. No runtime behavior change.	2026-06-12 15:07:34 -05:00
Keysat	5e6db2f63b	docs: record canonical AGENTS.md / symlink layout convention	2026-06-12 14:31:54 -05:00
Keysat	6a6112a15f	restructure: AGENTS.md canonical + docs/guides with .claude/rules symlinks Rename CLAUDE.md -> AGENTS.md (cross-vendor standard) with a relative CLAUDE.md symlink so Claude Code still loads it. Move each .claude/rules file into docs/guides/ (paths: frontmatter preserved) and replace the rules file with a relative symlink into the guide. Repoint the AGENTS.md index paragraph at docs/guides/ so non-Claude agents find the guides.	2026-06-12 14:27:17 -05:00

27 Commits