From 26070eb1914b6f53f11680125e5e1f0757770ae5 Mon Sep 17 00:00:00 2001 From: Keysat Date: Wed, 17 Jun 2026 23:03:33 -0500 Subject: [PATCH] v0.24.0:0 - configurable cluster topology (vllm container name, hide services, second-vllm monitor) Make the cluster topology configurable so an adopter wired differently (vLLM on both Sparks, port 8000, different container name, no Parakeet) can monitor without forking. Covers the OpenClaw report P4/P5/#6. - VLLM_CONTAINER override (default vllm_node), validated at the boundary and quote_arg-quoted into the swap log-tail + pre-flight validator exec. - DISABLED_SERVICES list: hidden services show no tile and are skipped by status/deep-health/connectivity probes (kills the Parakeet-on-8000 collision). - kind: vllm custom service monitors a second Spark's vLLM via the shared probe_vllm_endpoint; /api/endpoints gains a disabled flag. Swap mechanism intentionally not generalized to raw docker run (that's coordination, roadmap item 4). --- AGENTS.md | 4 +- ROADMAP.md | 2 +- image/app/config.py | 42 ++++++ image/app/custom_services.py | 11 ++ image/app/deep_health.py | 4 + image/app/health.py | 43 +++++-- image/app/server.py | 28 ++-- image/app/services.py | 15 ++- image/app/static/app.js | 4 + image/app/swap.py | 3 +- image/app/validate.py | 3 +- image/tests/test_topology.py | 120 ++++++++++++++++++ package/startos/actions/configureSparks.ts | 18 +++ .../startos/fileModels/sparkConfig.yaml.ts | 5 + package/startos/main.ts | 4 + package/startos/versions/v0_1_0.ts | 4 +- runbook.md | 20 +++ 17 files changed, 304 insertions(+), 26 deletions(-) create mode 100644 image/tests/test_topology.py diff --git a/AGENTS.md b/AGENTS.md index c3287d8..bde7d87 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -57,10 +57,10 @@ Subsystem guidance lives in `docs/guides/` and loads when matching files are tou - **Live service runs v0.22.0:0** (installed and serving); **v0.23.0:0 is built, committed (`e783653`), tagged, and published to Gitea Releases but its live install is PENDING** — see the P3 line below. Working features: swap dashboard; chat / transcribe / diarize(+chunk) / TTS proxies; embeddings + rerank + hybrid search (Qdrant); `/scrub` + `/rehydrate`; label-merge incl. dual-channel; per-Spark SSH-key copy + WireGuard `VPN ` hardware-card badge; configurable vLLM port (Configure Sparks field, blank ⇒ 8888). Local/fine-tuned model support lands live once v0.23.0:0 installs. Spark 2 audio stack healthy. Security hardening (v0.19.0:0 — shellsafe SSH-injection guard, Qdrant path-injection, same-origin CSRF guard) shipped and stable; evidence in `EVALUATION.md`. - **matrix-bridge bot tile (done, v0.21.0:1, verified live):** `bot`-kind service tile — status badge from docker-state only (no HTTP port), plus **Update** / Restart / Stop/Start / **View logs**. Code: `app/matrix_bridge.py` + `/api/matrix-bridge/{update,logs}` (update streams; 25-min cap; fail-loud). Driven directly as `modelo` on Spark 2 (**no `sudo -iu`** — spark2 has no passwordless sudo). User is a blank-default Configure-Sparks field (`matrix_bridge_user`); blank → tile hidden (portable). Host reuses `spark2_host` (`192.168.1.87` = the bot's box `spark-32d0`); container/dir/branch are env-overridable defaults. **Load-bearing ops dep:** Update's `git fetch` runs as `modelo`, which needs `modelo`'s `~/.ssh/config` pinning the Gitea deploy key with `IdentitiesOnly yes` — else the wrong key is offered and Gitea denies (publickey). Optional next, only if the bot dev asks: Docker `HEALTHCHECK` for running-but-disconnected detection (spec §Note). -- **Tests:** offline pytest harness in `image/tests/` — `cd image && .venv/bin/python -m pytest` (70 passing). Covers `build_launch_command` (incl. the shell-injection round-trip), the transcript↔diarizer label-merge, the `shellsafe` validators, and `matrix_bridge.build_update_command` (+ phase detection). Mock-heavy swap/proxy tests deliberately skipped (low ROI). Redaction + live-audio suites remain standalone scripts. +- **Tests:** offline pytest harness in `image/tests/` — `cd image && .venv/bin/python -m pytest` (102 passing). Covers `build_launch_command` (incl. the shell-injection round-trip + local-model bind-mount), the transcript↔diarizer label-merge, the `shellsafe` validators, `matrix_bridge.build_update_command` (+ phase detection), and the configurable-topology layer (`test_topology.py`: `DISABLED_SERVICES` parsing, `vllm_container` override, disabled-service skip in `services_from_settings` + `check_*`, `probe_vllm_endpoint`). Mock-heavy swap/proxy tests deliberately skipped (low ROI). Redaction + live-audio suites remain standalone scripts. - **Signal Engine "flakiness":** diagnosed as *not* a server bug — transient 1–4s unresponsiveness while the single GPU is busy. Client-side remedy (in-flight cap 2 / ceiling 3 / retry-on-timeout+503) drafted and **forwarded to that dev (owner confirmed 2026-06-15)**. Awaiting whether they want the measured concurrency knee. - **Stance (decided, not built):** no public interface / no API-token auth — LAN + WireGuard/Tailscale split-tunnel only; the CSRF guard covers the browser-driven vector. - **Known limits:** `/health` blips while the GPU is busy (mitigated client-side); dual-channel can miss a quiet local word under loud remote bleed; connectivity log misses sub-5s outages between 5s polls; diarizer caps at 4 speakers; matrix-bridge badge won't visibly flip on a fast `docker restart` (status re-checked only after the command returns). - **Infra gotcha (safety):** passwordless sudo is NOT configured on spark2 — design unprivileged probes for any Spark feature (the badge uses `ip`, not `sudo wg show`). spark2 sits on the `starttunnel` WireGuard subnet (`10.59.211.6/24`, survives reboot). Owner declined SSH-key rotation after the 2026-06-12 history scrub (only the key *name* leaked) — don't re-flag. - **Hosting:** self-hosted Gitea — remote `gitea`, branch `master`, over SSH; push after committing. (Wart: commit `8d839e3` is mislabeled `v0.13.0:4` but contains through v0.18.0:0.) -- **Next — committed 2026-06-17: OpenClaw/Johnny-5 coexistence epic (full plan + design stance in `ROADMAP.md` → "Cluster coordination").** Stance: Spark Control = control plane / GPU arbiter, **not** a job runner; business cron jobs live in separate services that *call* its swap API (swaps are already API-driven via `POST /api/swap`). Sequence: (1) **configurable `VLLM_PORT`** — SHIPPED **v0.22.0:0** (Configure-Sparks field, blank ⇒ 8888; + `_env_int` hardening in `config.py` so a blank/bad port no longer crashes startup, killing a P3 tech-debt item). Committed `136a471`, pushed, tagged `v0.22.0`, rebuilt clean, installed, and **published to the self-hosted Gitea Releases** 2026-06-17 (`make release` → `scripts/gitea-release.sh`, takes `GITEA_URL` + a write token). **Distribution model (decided 2026-06-17):** Gitea Releases + a read-only token the adopter's agent uses to pull the latest s9pk (`GET /api/v1/repos/grant/spark-control/releases/latest` → download the `.s9pk` asset → sideload). Note: Gitea returns `browser_download_url` on its `.local` ROOT_URL, which won't resolve off-LAN — a remote adopter pulls via whatever address reaches the Gitea (the WireGuard IP). (2) **local-path/fine-tuned models** — DONE in tree, staged as **v0.23.0:0** (`ModelDef.local_path` + exactly-one-source validator; swap bind-mounts the dir at the same container path via the launch script's `VLLM_SPARK_EXTRA_DOCKER_ARGS` hook, **no `launch-cluster.sh` change**; "+ Add local model" UI form + `local` badge; `validate_local_path`; disk-delete refused for local; 94 tests pass. Reviewer-agent pass done, findings addressed (path validation + chat-template-location guard folded into the `ModelDef` validator so YAML/override entries are checked too; `_merge_overrides` skips a bad entry instead of failing the whole catalog; `VLLM_SPARK_EXTRA_DOCKER_ARGS` contract documented in `runbook.md`). **Committed `e783653`, tagged `v0.23.0`, built clean, published to Gitea Releases — but `make install` to the live Start9 FAILED: `immense-voyage.local` wasn't resolving via mDNS from the Mac (server up at `192.168.1.72`; `start-cli -H ` reaches it but returns UNAUTHORIZED, auth bound to the registered `.local` host). FINISH-HERE: flush mDNS (`sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder`) or add a hosts entry, then re-run `cd package && make install`** (details in runbook → "Sideload can't reach the server"). Next: (3) configurable topology (service→Spark→port map + container names); (4) coordination layer (swap lock + swap webhook + schedule visibility) — only when our own automation lands. Still-open older threads: audio concurrency sweep (only if the Signal Engine dev wants the knee; needs a quiet window); optional matrix-bridge Docker `HEALTHCHECK` if the bot dev asks; Parakeet long-audio guard deferred (rationale in ROADMAP). +- **Next — committed 2026-06-17: OpenClaw/Johnny-5 coexistence epic (full plan + design stance in `ROADMAP.md` → "Cluster coordination").** Stance: Spark Control = control plane / GPU arbiter, **not** a job runner; business cron jobs live in separate services that *call* its swap API (swaps are already API-driven via `POST /api/swap`). Sequence: (1) **configurable `VLLM_PORT`** — SHIPPED **v0.22.0:0** (Configure-Sparks field, blank ⇒ 8888; + `_env_int` hardening in `config.py` so a blank/bad port no longer crashes startup, killing a P3 tech-debt item). Committed `136a471`, pushed, tagged `v0.22.0`, rebuilt clean, installed, and **published to the self-hosted Gitea Releases** 2026-06-17 (`make release` → `scripts/gitea-release.sh`, takes `GITEA_URL` + a write token). **Distribution model (decided 2026-06-17):** Gitea Releases + a read-only token the adopter's agent uses to pull the latest s9pk (`GET /api/v1/repos/grant/spark-control/releases/latest` → download the `.s9pk` asset → sideload). Note: Gitea returns `browser_download_url` on its `.local` ROOT_URL, which won't resolve off-LAN — a remote adopter pulls via whatever address reaches the Gitea (the WireGuard IP). (2) **local-path/fine-tuned models** — DONE in tree, staged as **v0.23.0:0** (`ModelDef.local_path` + exactly-one-source validator; swap bind-mounts the dir at the same container path via the launch script's `VLLM_SPARK_EXTRA_DOCKER_ARGS` hook, **no `launch-cluster.sh` change**; "+ Add local model" UI form + `local` badge; `validate_local_path`; disk-delete refused for local; 94 tests pass. Reviewer-agent pass done, findings addressed (path validation + chat-template-location guard folded into the `ModelDef` validator so YAML/override entries are checked too; `_merge_overrides` skips a bad entry instead of failing the whole catalog; `VLLM_SPARK_EXTRA_DOCKER_ARGS` contract documented in `runbook.md`). **Committed `e783653`, tagged `v0.23.0`, built clean, published to Gitea Releases — but `make install` to the live Start9 FAILED: `immense-voyage.local` wasn't resolving via mDNS from the Mac (server up at `192.168.1.72`; `start-cli -H ` reaches it but returns UNAUTHORIZED, auth bound to the registered `.local` host). FINISH-HERE: flush mDNS (`sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder`) or add a hosts entry, then re-run `cd package && make install`** (details in runbook → "Sideload can't reach the server"). (3) **configurable topology** — DONE in tree, staged as **v0.24.0:0** (built clean, not yet committed/installed). Three optional Configure-Sparks knobs: vLLM container name (`VLLM_CONTAINER`, blank ⇒ `vllm_node`, threaded into the swap log-tail + validator exec via `quote_arg`); "services to hide" (`DISABLED_SERVICES` comma list → `Settings.disabled_services` frozenset, skipped by `services_from_settings`, the `check_*` probes, deep-health `run_all`, and connectivity logging — kills the Parakeet-on-8000 collision); second-Spark vLLM monitor via a `kind: vllm` custom service in `services-overrides.yaml` (`probe_vllm_endpoint` shared with `check_vllm`). `/api/endpoints` gained a `disabled` flag; the health-dot hides when disabled. 102 tests pass (+8 in `test_topology.py`). Swap mechanism deliberately NOT generalized to raw `docker run` (that's coordination, item 4). Install pending — same mDNS situation as v0.23.0. Next: (4) coordination layer (swap lock + swap webhook + schedule visibility) — only when our own automation lands. Still-open older threads: audio concurrency sweep (only if the Signal Engine dev wants the knee; needs a quiet window); optional matrix-bridge Docker `HEALTHCHECK` if the bot dev asks; Parakeet long-audio guard deferred (rationale in ROADMAP). diff --git a/ROADMAP.md b/ROADMAP.md index 476c517..6a234f3 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -11,7 +11,7 @@ Driven by the one other Spark Control adopter (a colleague running OpenClaw + cr Sequenced: 1. **Configurable `VLLM_PORT`** — DONE, v0.22.0:0. Field in Configure Sparks (blank ⇒ 8888); numeric-setting parsing hardened so a blank/bad value falls back instead of crashing startup. Was the immediate "vLLM unreachable" bug for an adopter on port 8000. 2. **Local-path / fine-tuned model support** — DONE, v0.23.0:0. Catalog/`ModelDef` gained `local_path` (exactly one of `repo`/`local_path`); swap bind-mounts the dir into the vLLM container at the same path via the launch script's `VLLM_SPARK_EXTRA_DOCKER_ARGS` hook (no `launch-cluster.sh` change); "+ Add local model" form + `local` badge; disk-delete refused for local models; `validate_local_path` boundary check. His merged `ten31-v2` was the motivating case. -3. **Configurable topology** — make the service→Spark→port map and container names configurable so the package stops assuming our exact layout. Lets an adopter monitor vLLM on *both* Sparks, use a different container name, and stop the Parakeet probe from hitting a vLLM that shares its port — without forking. (Covers report P4 multi-Spark vLLM, P5 container name, and the Parakeet-port collision #6.) +3. **Configurable topology** — DONE, v0.24.0:0. Three optional Configure-Sparks knobs: vLLM container name (`VLLM_CONTAINER`, blank ⇒ `vllm_node`; threaded through the swap log-tail + pre-flight validator via `quote_arg`); "services to hide" (`DISABLED_SERVICES`, comma list — hidden services show no tile and are skipped by status/deep-health/connectivity probes, killing the Parakeet-on-8000 collision); and a second-Spark vLLM monitor via a `kind: vllm` custom service in `services-overrides.yaml` (read-only tile probed through the shared `probe_vllm_endpoint`). `/api/endpoints` gained a `disabled` flag. Covers report P4/P5/#6. (Generalizing the *swap* mechanism to the adopter's raw `docker run` was deliberately left out — that's coordination, item 4; he swaps via his own crons and uses Spark Control to monitor.) 4. **Coordination layer** — build when our own automation actually lands (zero value until something other than the dashboard swaps models): - **Swap lock** with holder + TTL (`POST` / `GET` / `DELETE /api/swap/lock`). An external scheduler acquires it before swapping; the dashboard then refuses manual swaps and shows who holds the GPU and until when. Enforced by the swap path, not advisory. - **Swap-event webhook** (`swap_complete` / `swap_failed`) to a configurable URL, so downstream consumers update their provider config when the running model changes. diff --git a/image/app/config.py b/image/app/config.py index e0d50aa..75107c4 100644 --- a/image/app/config.py +++ b/image/app/config.py @@ -1,13 +1,44 @@ from __future__ import annotations +import logging import os from dataclasses import dataclass from pathlib import Path +from .shellsafe import validate_container + +log = logging.getLogger(__name__) + def _env(name: str, default: str = "") -> str: return os.environ.get(name, default) +def _env_container(name: str, default: str) -> str: + """Resolve a container-name env var, validating it at the config boundary. + + The value flows into `docker logs`/`docker exec` over SSH, so it's quoted at + the sink — but per the repo's two-layer convention it's also whitelist-checked + here. A malformed optional value falls back to `default` rather than crashing + daemon startup (mirrors `_env_int` for VLLM_PORT).""" + val = os.environ.get(name, "") or default + try: + return validate_container(val) + except ValueError: + log.warning("ignoring invalid %s=%r; using %r", name, val, default) + return default + + +def _env_set(name: str) -> frozenset[str]: + """Parse a comma-separated env var into a lowercased frozenset of keys. + + Used by DISABLED_SERVICES so an adopter whose cluster doesn't run a given + support service can switch its tile + probes off entirely (rather than have + the probe hit whatever else listens on that port — e.g. a vLLM sharing + Parakeet's default 8000).""" + raw = os.environ.get(name, "") + return frozenset(part.strip().lower() for part in raw.split(",") if part.strip()) + + def _env_int(name: str, default: int) -> int: """Parse an int env var, falling back to `default` when unset, blank, or malformed. The StartOS Configure panel passes optional numeric fields as an @@ -63,6 +94,8 @@ class Settings: ssh_known_hosts: str models_yaml: str vllm_port: int + vllm_container: str + disabled_services: frozenset[str] parakeet_port: int kokoro_port: int embed_port: int @@ -116,6 +149,15 @@ class Settings: ssh_known_hosts=_env("SSH_KNOWN_HOSTS"), models_yaml=_resolve_models_yaml(), vllm_port=_env_int("VLLM_PORT", 8888), + # Container name for the swappable vLLM on Spark 1. Defaults to the + # bundled launch-cluster.sh container; override if you named yours + # something else (the swap log-tail and pre-flight validator exec + # into it by name). + vllm_container=_env_container("VLLM_CONTAINER", "vllm_node"), + # Built-in support-service keys (parakeet, kokoro, embeddings, + # qdrant) the deployment doesn't run — hidden from the dashboard and + # never probed. + disabled_services=_env_set("DISABLED_SERVICES"), parakeet_port=_env_int("PARAKEET_PORT", 8000), kokoro_port=_env_int("KOKORO_PORT", 8880), embed_port=_env_int("EMBED_PORT", 8088), diff --git a/image/app/custom_services.py b/image/app/custom_services.py index 3537ef8..18f88a9 100644 --- a/image/app/custom_services.py +++ b/image/app/custom_services.py @@ -10,6 +10,17 @@ Format: port: 8001 health_path: /health image: nvcr.io/nim/nvidia/riva-multilingual:latest + +A `kind: vllm` entry monitors an additional vLLM on another Spark (read-only — +the swap machinery only drives the primary Spark 1 vLLM). It gets a health tile +probed via /v1/models plus container state and start/stop/restart: + custom: + - key: vllm-spark2 + kind: vllm + host: + user: + container: vllm_node + port: 8000 """ from __future__ import annotations import os diff --git a/image/app/deep_health.py b/image/app/deep_health.py index bc15ef8..769d1ea 100644 --- a/image/app/deep_health.py +++ b/image/app/deep_health.py @@ -377,6 +377,10 @@ class DeepHealth: async def run_all(self) -> dict[str, ProbeResult]: results = {} for name in self.PROBES: + # Don't deep-probe a service the deployment switched off — its port + # may be answered by something else (e.g. a vLLM on Parakeet's 8000). + if name in self.settings.disabled_services: + continue results[name] = await self.run_one(name) return results diff --git a/image/app/health.py b/image/app/health.py index 1ddeb12..2ee6d89 100644 --- a/image/app/health.py +++ b/image/app/health.py @@ -6,17 +6,28 @@ from .config import Settings _TIMEOUT = 3.0 -async def check_vllm(settings: Settings) -> dict: - base_url = ( - f"http://{settings.spark1_host}:{settings.vllm_port}/v1" - if settings.spark1_host - else None - ) - if not settings.spark1_host: - return {"ok": False, "error": "spark1 not configured", "base_url": base_url} +def _disabled(settings: Settings, key: str) -> dict | None: + """A clean 'disabled' verdict if `key` is in DISABLED_SERVICES, else None. + + Lets an adopter who doesn't run a given support service switch its probe off + entirely — so the probe never hits whatever else listens on that port, and + the connectivity log doesn't record it as perpetually down.""" + if key in settings.disabled_services: + return {"ok": False, "disabled": True, "error": "disabled", "base_url": None} + return None + + +async def probe_vllm_endpoint(host: str, port: int) -> dict: + """Probe any OpenAI-compatible vLLM at host:port via /v1/models. + + Shared by the primary (Spark 1) health check and any extra vLLM registered + as a custom service (kind: vllm) to monitor a second Spark.""" + base_url = f"http://{host}:{port}/v1" if host else None + if not host: + return {"ok": False, "error": "vllm host not configured", "base_url": base_url} try: async with httpx.AsyncClient(timeout=_TIMEOUT) as c: - r = await c.get(f"http://{settings.spark1_host}:{settings.vllm_port}/v1/models") + r = await c.get(f"http://{host}:{port}/v1/models") r.raise_for_status() ids = [m["id"] for m in r.json().get("data", [])] return { @@ -29,7 +40,15 @@ async def check_vllm(settings: Settings) -> dict: return {"ok": False, "error": str(e), "base_url": base_url} +async def check_vllm(settings: Settings) -> dict: + if not settings.spark1_host: + return {"ok": False, "error": "spark1 not configured", "base_url": None} + return await probe_vllm_endpoint(settings.spark1_host, settings.vllm_port) + + async def check_parakeet(settings: Settings) -> dict: + if d := _disabled(settings, "parakeet"): + return d base_url = ( f"http://{settings.parakeet_host}:{settings.parakeet_port}" if settings.parakeet_host @@ -47,6 +66,8 @@ async def check_parakeet(settings: Settings) -> dict: async def check_kokoro(settings: Settings) -> dict: + if d := _disabled(settings, "kokoro"): + return d base_url = ( f"http://{settings.kokoro_host}:{settings.kokoro_port}" if settings.kokoro_host @@ -68,6 +89,8 @@ async def check_kokoro(settings: Settings) -> dict: async def check_embeddings(settings: Settings) -> dict: + if d := _disabled(settings, "embeddings"): + return d base_url = ( f"http://{settings.embed_host}:{settings.embed_port}" if settings.embed_host @@ -89,6 +112,8 @@ async def check_embeddings(settings: Settings) -> dict: async def check_qdrant(settings: Settings) -> dict: + if d := _disabled(settings, "qdrant"): + return d base_url = ( f"http://{settings.qdrant_host}:{settings.qdrant_port}" if settings.qdrant_host diff --git a/image/app/server.py b/image/app/server.py index 93ff0a5..e8249ea 100644 --- a/image/app/server.py +++ b/image/app/server.py @@ -20,7 +20,7 @@ from .llm_proxy import build_router as build_llm_router from .embeddings_proxy import build_router as build_embeddings_router from .redaction_gateway import build_router as build_redaction_router, MapStore from .hardware import HardwareProbe -from .health import check_kokoro, check_parakeet, check_vllm, check_embeddings, check_qdrant +from .health import check_kokoro, check_parakeet, check_vllm, check_embeddings, check_qdrant, probe_vllm_endpoint from .matrix_bridge import MatrixBridgeManager from .models import ModelDef, load_catalog from .nim import SUGGESTED_NIMS, CATALOG_URL, NimManager @@ -500,6 +500,10 @@ async def get_services() -> dict: http = await check_embeddings(settings) elif name == "qdrant": http = await check_qdrant(settings) + elif svc.kind == "vllm": + # An extra vLLM monitored on another Spark (registered as a custom + # service). Probe its own host/port, not the primary Spark 1 one. + http = await probe_vllm_endpoint(svc.host, svc.port) elif svc.kind == "bot": # No HTTP health endpoint (host networking, no port) — judged purely # by docker state. http_ready stays None so the badge isn't pinned @@ -521,7 +525,7 @@ async def get_services() -> dict: # Prefer the check fn's own top-level model key (embeddings reports # it there); fall back to a model field inside detail for services # whose /health embeds it (parakeet). - "model": http.get("model") or ((http.get("detail") or {}).get("model") if isinstance(http.get("detail"), dict) else None), + "model": http.get("model") or http.get("current_model") or ((http.get("detail") or {}).get("model") if isinstance(http.get("detail"), dict) else None), "docker_state": docker.get("state"), "restart_count": docker.get("restart_count"), "started_at": docker.get("started_at"), @@ -799,17 +803,20 @@ async def get_endpoints() -> dict: "base_url": vllm.get("base_url"), "model": vllm.get("current_model"), "openai_compat": True, + "disabled": bool(vllm.get("disabled")), }, "parakeet": { "ready": bool(parakeet.get("ok")), "base_url": parakeet.get("base_url"), "kind": "stt", "model": (parakeet.get("detail") or {}).get("model") if isinstance(parakeet.get("detail"), dict) else None, + "disabled": bool(parakeet.get("disabled")), }, "kokoro": { "ready": bool(kokoro.get("ok")), "base_url": kokoro.get("base_url"), "kind": "tts", + "disabled": bool(kokoro.get("disabled")), }, "embeddings": { "ready": bool(embeddings.get("ok")), @@ -818,12 +825,14 @@ async def get_endpoints() -> dict: "model": embeddings.get("model"), # The proxied OpenAI-compatible endpoints live on Spark Control itself. "openai_endpoints": ["/v1/embeddings", "/v1/rerank", "/api/search"], + "disabled": bool(embeddings.get("disabled")), }, "qdrant": { "ready": bool(qdrant.get("ok")), "base_url": qdrant.get("base_url"), "kind": "vectordb", "collection": settings.qdrant_collection or None, + "disabled": bool(qdrant.get("disabled")), }, } @@ -837,12 +846,15 @@ async def get_status() -> dict: check_embeddings(settings), check_qdrant(settings), ) - # Feed health into the connectivity log (deduped — only logs on transition) - record_state("vllm", bool(vllm.get("ok"))) - record_state("parakeet", bool(parakeet.get("ok"))) - record_state("kokoro", bool(kokoro.get("ok"))) - record_state("embeddings", bool(embeddings.get("ok"))) - record_state("qdrant", bool(qdrant.get("ok"))) + # Feed health into the connectivity log (deduped — only logs on transition). + # Skip services switched off via DISABLED_SERVICES — they'd otherwise log as + # perpetually down. + for _name, _r in ( + ("vllm", vllm), ("parakeet", parakeet), ("kokoro", kokoro), + ("embeddings", embeddings), ("qdrant", qdrant), + ): + if not _r.get("disabled"): + record_state(_name, bool(_r.get("ok"))) current_key = _identify_current_model(vllm.get("current_model")) return { "configured": settings.configured, diff --git a/image/app/services.py b/image/app/services.py index 2c9b71b..01795bb 100644 --- a/image/app/services.py +++ b/image/app/services.py @@ -5,6 +5,7 @@ machinery. We just run `docker start|stop|restart ` via SSH on the appropriate host. """ from __future__ import annotations +import logging import time from dataclasses import dataclass from typing import Literal, Optional @@ -13,6 +14,8 @@ from .config import Settings from .shellsafe import quote_arg from .ssh import ssh_run +log = logging.getLogger(__name__) + # Cache the "unreachable" verdict per (host, user) for a short period so that a # repeated docker_state call doesn't re-pay the 6 s SSH connect timeout each time. @@ -103,7 +106,13 @@ def services_from_settings(s: Settings) -> dict[str, ServiceDef]: } for entry in load_custom_services(): key = entry.get("key") - if not key or key in out: + if not key: + continue + if key in out: + # A custom entry can't shadow a built-in (parakeet/kokoro/…); warn so + # an adopter who picked a colliding key for, say, a second vLLM sees + # why no tile appeared instead of a silent no-op. + log.warning("custom service %r collides with a built-in name; ignoring", key) continue out[key] = ServiceDef( name=key, @@ -113,7 +122,9 @@ def services_from_settings(s: Settings) -> dict[str, ServiceDef]: container=entry.get("container", key), port=int(entry.get("port", 0)), ) - return out + # Drop services the deployment has switched off (DISABLED_SERVICES) so they + # show no tile and are never probed/auto-restarted. + return {k: v for k, v in out.items() if k not in s.disabled_services} async def docker_state(settings: Settings, svc: ServiceDef) -> dict: diff --git a/image/app/static/app.js b/image/app/static/app.js index ff96c14..7ea1778 100644 --- a/image/app/static/app.js +++ b/image/app/static/app.js @@ -932,6 +932,10 @@ function renderHealth(status) { function setDot(id, ok, payload) { const item = el(id); if (!item) return; + // A service switched off via DISABLED_SERVICES isn't part of this + // deployment — hide its indicator entirely rather than show it as down. + if (payload && payload.disabled) { item.classList.add('hidden'); return; } + item.classList.remove('hidden'); const dot = item.querySelector('.dot'); dot.classList.remove('ok', 'bad', 'warn'); if (ok === true) dot.classList.add('ok'); diff --git a/image/app/swap.py b/image/app/swap.py index 07d400a..49a1bc1 100644 --- a/image/app/swap.py +++ b/image/app/swap.py @@ -7,6 +7,7 @@ from typing import Optional from .config import Settings from .models import Catalog, build_launch_command +from .shellsafe import quote_arg from .ssh import ssh_run, ssh_stream, StreamHandle @@ -112,7 +113,7 @@ class SwapManager: # Step 3: tail logs until the ready marker (or timeout) job.state = "tailing" - tail_cmd = "docker logs -f --tail 50 vllm_node" + tail_cmd = f"docker logs -f --tail 50 {quote_arg(s.vllm_container)}" job.append(f"$ {tail_cmd}") timeout = max(model.expected_ready_seconds * 2, 600) handle = StreamHandle() diff --git a/image/app/validate.py b/image/app/validate.py index 983e267..548c81f 100644 --- a/image/app/validate.py +++ b/image/app/validate.py @@ -22,6 +22,7 @@ from typing import Any from .config import Settings from .models import Catalog, build_launch_command +from .shellsafe import quote_arg from .ssh import ssh_run @@ -114,7 +115,7 @@ async def validate_launch(key: str, catalog: Catalog, settings: Settings) -> dic # Pipe the JSON args list to a here-doc Python invocation. The validator # reads from stdin to avoid shell-escaping the args themselves. cmd = ( - f"echo '{payload}' | docker exec -i vllm_node python3 -c " + f"echo '{payload}' | docker exec -i {quote_arg(settings.vllm_container)} python3 -c " + shlex.quote(_VALIDATOR_SCRIPT) ) diff --git a/image/tests/test_topology.py b/image/tests/test_topology.py new file mode 100644 index 0000000..3e978ba --- /dev/null +++ b/image/tests/test_topology.py @@ -0,0 +1,120 @@ +"""Configurable topology: DISABLED_SERVICES, vLLM container override, and the +extra-vLLM probe. All offline — the disabled checks short-circuit before any +network call, and the probes are exercised only on the not-configured path. +""" +import asyncio + +from app.config import Settings +from app.health import ( + check_embeddings, + check_kokoro, + check_parakeet, + check_qdrant, + check_vllm, + probe_vllm_endpoint, +) +from app.services import services_from_settings + + +def _settings(monkeypatch, **env) -> Settings: + # Pin the topology env vars under test; default the rest to blank so a stray + # value in the real environment can't leak into the assertion. + keys = [ + "SPARK1_HOST", "SPARK1_USER", "SPARK2_HOST", "SPARK2_USER", + "DISABLED_SERVICES", "VLLM_CONTAINER", + ] + for k in keys: + monkeypatch.delenv(k, raising=False) + for k, v in env.items(): + monkeypatch.setenv(k, v) + return Settings.from_env() + + +# ---- DISABLED_SERVICES parsing ---- + +def test_disabled_services_parsed_lowercased_and_trimmed(monkeypatch): + s = _settings(monkeypatch, DISABLED_SERVICES="parakeet, Kokoro ,,") + assert s.disabled_services == frozenset({"parakeet", "kokoro"}) + + +def test_disabled_services_blank_is_empty(monkeypatch): + assert _settings(monkeypatch).disabled_services == frozenset() + + +# ---- vLLM container override ---- + +def test_vllm_container_defaults_to_vllm_node(monkeypatch): + assert _settings(monkeypatch).vllm_container == "vllm_node" + + +def test_vllm_container_override(monkeypatch): + assert _settings(monkeypatch, VLLM_CONTAINER="vllm-gemma4").vllm_container == "vllm-gemma4" + + +def test_vllm_container_invalid_falls_back(monkeypatch): + # A malformed value (space / shell metachar) is rejected at the boundary and + # falls back to the default rather than crashing startup or reaching a sink. + assert _settings(monkeypatch, VLLM_CONTAINER="bad name; rm -rf").vllm_container == "vllm_node" + + +# ---- services map honors the disable list ---- + +def test_services_from_settings_drops_disabled(monkeypatch): + s = _settings( + monkeypatch, + SPARK1_HOST="10.0.0.1", SPARK1_USER="u", + SPARK2_HOST="10.0.0.2", SPARK2_USER="u", + DISABLED_SERVICES="parakeet,qdrant", + ) + svcs = services_from_settings(s) + assert "parakeet" not in svcs and "qdrant" not in svcs + assert "kokoro" in svcs and "embeddings" in svcs + + +def test_custom_vllm_service_registered(monkeypatch): + from app import custom_services + monkeypatch.setattr(custom_services, "load_custom_services", lambda: [ + {"key": "vllm-spark2", "kind": "vllm", "host": "10.0.0.2", + "user": "u", "container": "vllm_node", "port": 8000}, + ]) + s = _settings(monkeypatch, SPARK1_HOST="10.0.0.1", SPARK1_USER="u", + SPARK2_HOST="10.0.0.2", SPARK2_USER="u") + svc = services_from_settings(s)["vllm-spark2"] + assert svc.kind == "vllm" and svc.port == 8000 and svc.container == "vllm_node" + + +def test_custom_service_colliding_with_builtin_is_ignored(monkeypatch): + # A custom entry can't shadow a built-in key — the built-in wins. + from app import custom_services + monkeypatch.setattr(custom_services, "load_custom_services", lambda: [ + {"key": "parakeet", "kind": "vllm", "host": "10.0.0.9", "user": "u", "port": 8000}, + ]) + s = _settings(monkeypatch, SPARK1_HOST="10.0.0.1", SPARK1_USER="u", + SPARK2_HOST="10.0.0.2", SPARK2_USER="u") + assert services_from_settings(s)["parakeet"].kind == "stt" + + +# ---- disabled health checks short-circuit (no network) ---- + +def test_disabled_check_returns_disabled_verdict(monkeypatch): + s = _settings( + monkeypatch, + SPARK2_HOST="10.0.0.2", SPARK2_USER="u", # host set, but disable wins + DISABLED_SERVICES="parakeet,kokoro,embeddings,qdrant", + ) + for check in (check_parakeet, check_kokoro, check_embeddings, check_qdrant): + r = asyncio.run(check(s)) + assert r == {"ok": False, "disabled": True, "error": "disabled", "base_url": None} + + +# ---- vLLM probe: not-configured path is pure ---- + +def test_probe_vllm_endpoint_unconfigured(monkeypatch): + r = asyncio.run(probe_vllm_endpoint("", 8000)) + assert r["ok"] is False and "not configured" in r["error"] + + +def test_check_vllm_unconfigured_without_spark1(monkeypatch): + s = _settings(monkeypatch) # no SPARK1_HOST + r = asyncio.run(check_vllm(s)) + assert r["ok"] is False and "spark1 not configured" in r["error"] diff --git a/package/startos/actions/configureSparks.ts b/package/startos/actions/configureSparks.ts index abd8168..64d6610 100644 --- a/package/startos/actions/configureSparks.ts +++ b/package/startos/actions/configureSparks.ts @@ -49,6 +49,24 @@ const inputSpec = InputSpec.of({ placeholder: 'leave blank for 8888', masked: false, }), + vllm_container: Value.text({ + name: 'vLLM container name (optional)', + description: + 'Docker container name for the swappable vLLM on Spark 1. Defaults to "vllm_node" (what the bundled launch-cluster.sh creates). Change this only if you run your vLLM under a different container name — the model-swap log view and the pre-flight validator exec into it by name.', + required: false, + default: null, + placeholder: 'leave blank for vllm_node', + masked: false, + }), + disabled_services: Value.text({ + name: 'Services to hide (optional)', + description: + "Comma-separated list of built-in services your cluster doesn't run, so Spark Control hides their tiles and stops probing them. Valid names: parakeet, kokoro, embeddings, qdrant. Example: if you only run vLLM, set this to 'parakeet,kokoro,embeddings,qdrant'. Leave blank to monitor all of them. (Useful when, say, your vLLM shares port 8000 with Parakeet's default — hide Parakeet so its probe doesn't hit vLLM.)", + required: false, + default: null, + placeholder: 'e.g. parakeet,kokoro', + masked: false, + }), parakeet_host: Value.text({ name: 'Parakeet host (optional)', description: diff --git a/package/startos/fileModels/sparkConfig.yaml.ts b/package/startos/fileModels/sparkConfig.yaml.ts index 85a63b6..a1d1545 100644 --- a/package/startos/fileModels/sparkConfig.yaml.ts +++ b/package/startos/fileModels/sparkConfig.yaml.ts @@ -9,6 +9,11 @@ export const sparkConfigSchema = z.object({ spark2_user: z.string().catch(''), // Optional vLLM port override (Spark 1). Blank => 8888 (launch-cluster.sh default). vllm_port: z.string().catch(''), + // Optional vLLM container-name override (Spark 1). Blank => "vllm_node". + vllm_container: z.string().catch(''), + // Optional comma-separated list of built-in services to switch off + // (parakeet, kokoro, embeddings, qdrant). Blank => all enabled. + disabled_services: z.string().catch(''), // Optional per-service overrides. Blank => use spark2_host / spark2_user. parakeet_host: z.string().catch(''), parakeet_user: z.string().catch(''), diff --git a/package/startos/main.ts b/package/startos/main.ts index 9595fa6..96df6c6 100644 --- a/package/startos/main.ts +++ b/package/startos/main.ts @@ -14,6 +14,8 @@ export const main = sdk.setupMain(async ({ effects }) => { spark2_host: '', spark2_user: '', vllm_port: '', + vllm_container: '', + disabled_services: '', parakeet_host: '', parakeet_user: '', parakeet_container: '', @@ -52,6 +54,8 @@ export const main = sdk.setupMain(async ({ effects }) => { SPARK2_HOST: cfg.spark2_host, SPARK2_USER: cfg.spark2_user, VLLM_PORT: cfg.vllm_port, + VLLM_CONTAINER: cfg.vllm_container, + DISABLED_SERVICES: cfg.disabled_services, PARAKEET_HOST: cfg.parakeet_host, PARAKEET_USER: cfg.parakeet_user, PARAKEET_CONTAINER: cfg.parakeet_container, diff --git a/package/startos/versions/v0_1_0.ts b/package/startos/versions/v0_1_0.ts index 7607d14..9415853 100644 --- a/package/startos/versions/v0_1_0.ts +++ b/package/startos/versions/v0_1_0.ts @@ -1,10 +1,10 @@ import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk' export const v0_1_0 = VersionInfo.of({ - version: '0.23.0:0', + version: '0.24.0:0', releaseNotes: { en_US: - "v0.23.0:0 — local / fine-tuned model support. You can now add a model that lives as a directory on a Spark (e.g. a LoRA-merged fine-tune), not just a Hugging Face repo. Use the new \"+ Add local model\" button under LLM swap: give it the model's absolute path on the Spark, an optional chat-template path, and the usual launch knobs. On swap, Spark Control bind-mounts that directory into the vLLM container at the same path (via the launch script's existing VLLM_SPARK_EXTRA_DOCKER_ARGS hook — nothing to change on the Spark) and runs `vllm serve `. Local models show a \"local\" badge and their path instead of a Hugging Face link, and their weights are never offered for dashboard deletion (that directory is your own training output, not a re-downloadable cache). API: POST /api/models now accepts `local_path` (set exactly one of `repo` or `local_path`), validated against a strict path whitelist with no traversal.", + "v0.24.0:0 — configurable cluster topology. Spark Control no longer assumes our exact layout, so a cluster that's wired differently can be monitored without forking. Three new optional settings in Configure Sparks: (1) vLLM container name — defaults to \"vllm_node\"; set it if your swappable vLLM runs under a different container name (the swap log view and pre-flight validator exec into it by name). (2) Services to hide — a comma-separated list of built-in services your cluster doesn't run (parakeet, kokoro, embeddings, qdrant); hidden ones show no tile and are never probed, so e.g. a vLLM sharing Parakeet's default port 8000 no longer gets a confusing Parakeet probe. (3) Monitor a second vLLM — register a vLLM on another Spark as a custom service with kind \"vllm\" (in /data/services-overrides.yaml); it gets a read-only health tile (loaded model + container state + start/stop/restart) alongside the swappable one. API: /api/endpoints now reports a `disabled` flag per service.", }, migrations: { up: async ({ effects }) => {}, diff --git a/runbook.md b/runbook.md index 1f3f438..2eeb841 100644 --- a/runbook.md +++ b/runbook.md @@ -52,6 +52,26 @@ The **Update** button runs `git fetch && git reset --hard origin/ && doc 3. Spark Control's own package key must be authorized for that SSH user (Show Public Key → add to their `authorized_keys`) unless it's the same user Spark Control already uses for that Spark. +## Configurable topology (v0.24.0+) + +For a cluster wired differently from the reference layout, three optional knobs in **Configure Sparks** (no fork needed): + +- **vLLM container name** — defaults to `vllm_node`. Set it if your swappable vLLM on Spark 1 runs under a different container name; the swap log-tail and the pre-flight validator `docker exec` into it by name. +- **Services to hide** — comma-separated `parakeet,kokoro,embeddings,qdrant`. Hidden services show no tile and are never probed (status, deep-health, or connectivity log). Use this when a service you don't run would otherwise be probed at a port something else answers — e.g. a vLLM on port 8000 colliding with Parakeet's default. +- **Monitor a second vLLM** — the swap machinery only drives the Spark 1 vLLM, but you can *monitor* a vLLM on another Spark by adding a custom service of `kind: vllm` to `/data/services-overrides.yaml`: + + ```yaml + custom: + - key: vllm-spark2 + kind: vllm + host: + user: + container: vllm_node + port: 8000 + ``` + + It gets a read-only tile: loaded model (via `/v1/models`), container state, and start/stop/restart. (Spark Control's SSH key must be authorized for that user — Show Public Key.) + ## Adding a new model 1. Add an entry to `image/models.yaml`. Required fields: `display_name`, `repo`, `size_gb`, `mode` (`solo` or `cluster`), `vllm_args`. Optional but recommended: `description` (one paragraph — what the model is, what it's good for, how it differs from others; renders below the meta tags in each card), `capabilities` (tags like `[vision, reasoning, tools]`), `expected_ready_seconds`.