Move the ~20 optional cluster knobs out of the StartOS "Configure Sparks"
action (now just the 4 required fields) and into a dashboard ⚙ Settings gear,
backed by a /data/app_settings.json overlay keyed by env-var names. One shared
mutable Settings instance + Settings.reload() applies edits live without a
restart; existing installs' values migrate automatically on first boot.
Also: support-service ports (parakeet/kokoro/embed/qdrant + vllm) are now
configurable, and GET /api/swap/lock no longer 404s (it was shadowed by the
/api/swap/{job_id} catch-all). WebhookNotifier is re-pointed on save so its
url/secret reload live too.
The dashboard menu is now the set of models actually downloaded on the
Sparks, not a hard-coded catalog. models.yaml + overrides are reframed as
launch recipes matched to an on-disk model by repo; an on-disk model with
no recipe is flagged needs_setup and its launch settings are inferred from
its config.json for a one-time operator confirmation (discovery.py).
- delete now removes weights AND the menu card (delete_from_disk sweeps all
hosts; the delete endpoint resolves keys via the live menu)
- new GET /api/models/suggest; /api/models returns the menu + a recipes list
(download autocomplete); GET /api/models/disk-status removed
- dropped the two legacy Qwen recipes (235B FP8, 2.5 72B)
- tests: +test_discovery.py (cache parsing, infer_recipe, build_menu merge)
Make the cluster topology configurable so an adopter wired differently
(vLLM on both Sparks, port 8000, different container name, no Parakeet)
can monitor without forking. Covers the OpenClaw report P4/P5/#6.
- VLLM_CONTAINER override (default vllm_node), validated at the boundary
and quote_arg-quoted into the swap log-tail + pre-flight validator exec.
- DISABLED_SERVICES list: hidden services show no tile and are skipped by
status/deep-health/connectivity probes (kills the Parakeet-on-8000
collision).
- kind: vllm custom service monitors a second Spark's vLLM via the shared
probe_vllm_endpoint; /api/endpoints gains a disabled flag.
Swap mechanism intentionally not generalized to raw docker run (that's
coordination, roadmap item 4).
Add models that live as a directory on a Spark (e.g. LoRA-merged fine-tunes),
not just Hugging Face repos.
- ModelDef gains local_path; a model must set exactly one of repo / local_path.
The validator also enforces the local-path whitelist and that any
--chat-template lives inside local_path (only that dir is mounted).
- build_launch_command bind-mounts the dir into the vLLM container at the SAME
host==container path via the launch script's VLLM_SPARK_EXTRA_DOCKER_ARGS hook,
then `vllm serve <dir>`. No launch-cluster.sh change (verified the upstream
expands that var unquoted; contract noted in runbook.md).
- shellsafe.validate_local_path: absolute path, charset whitelist, no '.'/'..'.
- POST /api/models validates the full entry via ModelDef before persisting, so a
bad entry can't be written and then break catalog load; _merge_overrides skips
an invalid override entry instead of failing the whole catalog.
- disk.py size-probes a local path with du; disk-delete refused for local models.
- UI: "+ Add local model" dialog, `local` badge, path shown instead of an HF
link, delete button hidden for local models.
- Tests: local launch + injection round-trip, chat-template location, traversal,
exactly-one-source, _merge_overrides skip-invalid (94 pass). Reviewer-agent
pass; findings addressed.
- Configure Sparks gains a vLLM port field (blank => 8888, our launch-cluster.sh
default); VLLM_PORT plumbed configureSparks -> sparkConfig.yaml -> main.ts env
-> config.py. So an adopter whose vLLM listens elsewhere (e.g. 8000) can fix
the "vLLM unreachable" health check without rebuilding the package.
- Harden numeric env parsing (config._env_int): a blank or malformed port now
falls back to its default instead of crashing daemon startup (closes a P3
tech-debt item; the Configure panel passes unset optional fields as "").
- Add scripts/gitea-release.sh + `make release` to publish the built s9pk to
Gitea Releases, so the OpenClaw adopter pulls updates with a read-only token
instead of being hand-sent the package.
- Capture the OpenClaw/Johnny-5 coexistence epic and the "control plane, not a
job runner" stance in ROADMAP.md and Current state.
Cross-repo git-hygiene audit remediation: surface ~/Projects/standards/INBOX.md items at session start, and switch .gitignore to the deny-by-default .claude/* block (shared wiring allow-listed) plus the canonical secrets/env lines — per standards/portability.md.
- AGENTS.md: rewrite Current state lean for v0.19.0:0; drop the now-completed
full-eval triage block (history lives in git log + EVALUATION.md).
- docs/guides/fastapi-image.md: add two durable conventions — user values
crossing into SSH must go through shellsafe; new endpoints and the
csrf_guard exempt-prefix rule.
- ROADMAP.md: park the remaining non-blocking P2/P3 tech debt from the eval.
History was rewritten with git filter-repo to purge owner-specific values
(IPs, hostnames, SSH username, key name, personal names) from all commits,
tags, and messages — including three LAN IPs and one Start9 address the
v0.18.0:1 working-tree scrub had missed (one still live in HEAD at
docs/AUDIO_API.md). Verified 0 hits across all refs.
- AGENTS.md: Portability + Repo-wart + work-queue #2 + shipping note updated;
commit-SHA references repointed to post-rewrite SHAs (367d986->8d839e3).
- EVALUATION.md: P0 owner-data finding marked resolved; cleaned shorthand
IP-octet fragments (/.87, /11) left by the placeholder substitution.
Triaged from a full independent evaluation (EVALUATION.md). Addresses the
three P0/P1 code findings; the proxy/data APIs that downstream apps consume
are deliberately untouched.
- ssh command injection (P0): new shellsafe.py validates + shlex.quotes every
user-supplied value crossing into an SSH command on the Sparks (model repo,
vllm args/knobs, NIM image/container/volume/port/env, service names).
Boundary validation on POST /api/models and POST /api/nim/install; quoting at
every sink in models/download/nim/services. NGC key now quoted too.
- qdrant path injection (P1): /api/search validates the collection name against
a metacharacter-free whitelist and URL-encodes the path segment.
- csrf (P1): csrf_guard middleware enforces same-origin on state-changing
control endpoints; /v1/*, /scrub, /rehydrate, /api/search, /api/audio/* and
/api/health-event are exempt so external consumers are unaffected.
Verified: injection survives only as a single quoted token, vLLM preflight
shlex.split round-trip intact, CSRF behaviors covered via TestClient, both
offline redaction suites still pass, tsc clean, s9pk rebuilt.
Replace real cluster IPs/hosts/usernames and example names with neutral
placeholders across docs, ops notes, package install text, and the offline
redaction test; delete the obsolete build-time starter prompt. Closes the
portability audit's single blocker. No runtime behavior change.
Rename CLAUDE.md -> AGENTS.md (cross-vendor standard) with a relative
CLAUDE.md symlink so Claude Code still loads it. Move each .claude/rules
file into docs/guides/ (paths: frontmatter preserved) and replace the
rules file with a relative symlink into the guide. Repoint the AGENTS.md
index paragraph at docs/guides/ so non-Claude agents find the guides.