v0.22.0:0 - configurable vllm port; gitea-release tooling; coexistence roadmap

- Configure Sparks gains a vLLM port field (blank => 8888, our launch-cluster.sh default); VLLM_PORT plumbed configureSparks -> sparkConfig.yaml -> main.ts env -> config.py. So an adopter whose vLLM listens elsewhere (e.g. 8000) can fix the "vLLM unreachable" health check without rebuilding the package. - Harden numeric env parsing (config._env_int): a blank or malformed port now falls back to its default instead of crashing daemon startup (closes a P3 tech-debt item; the Configure panel passes unset optional fields as ""). - Add scripts/gitea-release.sh + `make release` to publish the built s9pk to Gitea Releases, so the OpenClaw adopter pulls updates with a read-only token instead of being hand-sent the package. - Capture the OpenClaw/Johnny-5 coexistence epic and the "control plane, not a job runner" stance in ROADMAP.md and Current state.
2026-06-17 19:45:09 -05:00
parent c179389731
commit 136a4713a1
9 changed files with 104 additions and 10 deletions
@@ -63,4 +63,4 @@ Subsystem guidance lives in `docs/guides/` and loads when matching files are tou
 - **Known limits:** `/health` blips while the GPU is busy (mitigated client-side); dual-channel can miss a quiet local word under loud remote bleed; connectivity log misses sub-5s outages between 5s polls; diarizer caps at 4 speakers; matrix-bridge badge won't visibly flip on a fast `docker restart` (status re-checked only after the command returns).
 - **Infra gotcha (safety):** passwordless sudo is NOT configured on spark2 — design unprivileged probes for any Spark feature (the badge uses `ip`, not `sudo wg show`). spark2 sits on the `starttunnel` WireGuard subnet (`10.59.211.6/24`, survives reboot). Owner declined SSH-key rotation after the 2026-06-12 history scrub (only the key *name* leaked) — don't re-flag.
 - **Hosting:** self-hosted Gitea — remote `gitea`, branch `master`, over SSH; push after committing. (Wart: commit `8d839e3` is mislabeled `v0.13.0:4` but contains through v0.18.0:0.)
- **Next:** (1) audio concurrency sweep — only if the Signal Engine dev wants the measured knee; needs owner OK in a quiet window. (2) Otherwise pull from `ROADMAP.md`: local-path/fine-tuned model support (new) or P2 tech-debt. (matrix-bridge Phase 3 shipped v0.21.0:1; only open item is the optional Docker `HEALTHCHECK` if the bot dev asks. Parakeet long-audio guard deferred — rationale in ROADMAP.)
+- **Next — committed 2026-06-17: OpenClaw/Johnny-5 coexistence epic (full plan + design stance in `ROADMAP.md` → "Cluster coordination").** Stance: Spark Control = control plane / GPU arbiter, **not** a job runner; business cron jobs live in separate services that *call* its swap API (swaps are already API-driven via `POST /api/swap`). Sequence: (1) **configurable `VLLM_PORT`** — DONE in tree, staged as **v0.22.0:0** (Configure-Sparks field, blank ⇒ 8888; + `_env_int` hardening in `config.py` so a blank/bad port no longer crashes startup, killing a P3 tech-debt item). **Not yet built/installed/committed — awaiting go/no-go.** (2) local-path/fine-tuned models (in ROADMAP under Dashboard). (3) configurable topology (service→Spark→port map + container names). (4) coordination layer (swap lock + swap webhook + schedule visibility) — only when our own automation lands. Still-open older threads: audio concurrency sweep (only if the Signal Engine dev wants the knee; needs a quiet window); optional matrix-bridge Docker `HEALTHCHECK` if the bot dev asks; Parakeet long-audio guard deferred (rationale in ROADMAP).
@@ -2,6 +2,21 @@

 Longer-term backlog, roughly ordered. An item moves to "Current state" in CLAUDE.md when picked up.

+## Cluster coordination — OpenClaw coexistence (committed 2026-06-17, from Johnny 5 report 2026-06-16)
+
+Driven by the one other Spark Control adopter (a colleague running OpenClaw + cron jobs against his own dual Sparks; report at the date above). His cluster is configured differently from ours (vLLM on **both** Sparks, port 8000, raw `docker run`, container `vllm-gemma4`) and an automated cron physically swaps models — so his notes are partly *portability gaps* (the package hard-codes our layout) and partly *coordination gaps* (his dashboard and his crons fight over the GPU).
+
+**Design stance (decided):** Spark Control is the **control plane / GPU arbiter, not a job runner.** Recurring business pipelines (his "Daily Vol" generator; our own future scheduled jobs) live in *separate* application services that *call* Spark Control's swap API. The dividing line is what a scheduled job *does*: control-plane actions (swap a model, warm it, restart a service, run a health sweep) are in scope for an in-package scheduler; business logic (scrape / summarize / build / deploy) stays in the app layer. Swaps are already API-driven (`POST /api/swap` → `GET /api/swap/{id}` / `…/stream`, `POST /api/swap/{key}/validate`) and non-browser clients pass the CSRF guard, so an external scheduler can drive swaps **today** — the items below add the *safety* layer, not the capability.
+
+Sequenced:
+1. **Configurable `VLLM_PORT`** — DONE, v0.22.0:0. Field in Configure Sparks (blank ⇒ 8888); numeric-setting parsing hardened so a blank/bad value falls back instead of crashing startup. Was the immediate "vLLM unreachable" bug for an adopter on port 8000.
+2. **Local-path / fine-tuned model support** — see the dedicated item under "## Dashboard" below. Independently wanted; his merged `ten31-v2` (a directory, not an HF repo) is the motivating case.
+3. **Configurable topology** — make the service→Spark→port map and container names configurable so the package stops assuming our exact layout. Lets an adopter monitor vLLM on *both* Sparks, use a different container name, and stop the Parakeet probe from hitting a vLLM that shares its port — without forking. (Covers report P4 multi-Spark vLLM, P5 container name, and the Parakeet-port collision #6.)
+4. **Coordination layer** — build when our own automation actually lands (zero value until something other than the dashboard swaps models):
+   - **Swap lock** with holder + TTL (`POST` / `GET` / `DELETE /api/swap/lock`). An external scheduler acquires it before swapping; the dashboard then refuses manual swaps and shows who holds the GPU and until when. Enforced by the swap path, not advisory.
+   - **Swap-event webhook** (`swap_complete` / `swap_failed`) to a configurable URL, so downstream consumers update their provider config when the running model changes.
+   - **Schedule visibility** — read-only view the dashboard surfaces, *registered by* external schedulers (Spark Control does not own the schedule).
+
 ## Near term
 - parakeet-asr long-audio memory guard — **deferred 2026-06-15, low priority.** A duration cap on `/v1/audio/diarize`: Sortformer runs the whole file in one pass (`diarizer.py:128-135`) over Spark 2's *shared* 128 GB unified memory (also feeding Kokoro/embeddings/Qdrant), so one giant single file can thrash into swap. **Precautionary — no observed incident**, and the production consumer (Recap Relay) already chunks via `/diarize-chunk` (~5-min, already bounded), so the only exposed path is a consumer POSTing one huge file to the full `/diarize`. When picked up: add a configurable `MAX_DIARIZE_SECONDS` guard in `diarizer.py` right after `duration` is computed (~line 130) → raise → HTTP 413 in `main.py` (mirrors the existing `MAX_UPLOAD_MB` 413); ship via the Reapply-patches action (restarts the live parakeet-asr container → needs go/no-go). Leave transcription out of v1 (upstream/un-patched file; parakeet-TDT handles long audio better). Revisit only if a consumer starts sending long single files.
 - Controlled concurrency sweep of the audio endpoints in a quiet window — replace the reasoned in-flight cap (2, ceiling 3) with the measured knee.
@@ -8,6 +8,16 @@ def _env(name: str, default: str = "") -> str:
    return os.environ.get(name, default)


+def _env_int(name: str, default: int) -> int:
+    """Parse an int env var, falling back to `default` when unset, blank, or
+    malformed. The StartOS Configure panel passes optional numeric fields as an
+    empty string when left blank, so a bare int("") would crash daemon startup."""
+    try:
+        return int(os.environ.get(name, "") or default)
+    except (TypeError, ValueError):
+        return default
+
+
 def _resolve_models_yaml() -> str:
    if env := os.environ.get("MODELS_YAML"):
        return env
@@ -101,16 +111,16 @@ class Settings:
            matrix_bridge_branch=_env("MATRIX_BRIDGE_BRANCH") or "master",
            # Redaction gateway pseudonym-map store (server-held de-anon key).
            redaction_map_db=_env("REDACTION_MAP_DB", "/data/redaction_maps.db"),
-            redaction_map_ttl=int(_env("REDACTION_MAP_TTL", "7200")),
+            redaction_map_ttl=_env_int("REDACTION_MAP_TTL", 7200),
            ssh_key_path=_env("SSH_KEY_PATH"),
            ssh_known_hosts=_env("SSH_KNOWN_HOSTS"),
            models_yaml=_resolve_models_yaml(),
-            vllm_port=int(_env("VLLM_PORT", "8888")),
-            parakeet_port=int(_env("PARAKEET_PORT", "8000")),
-            kokoro_port=int(_env("KOKORO_PORT", "8880")),
-            embed_port=int(_env("EMBED_PORT", "8088")),
-            qdrant_port=int(_env("QDRANT_PORT", "6333")),
-            bind_port=int(_env("BIND_PORT", "9999")),
+            vllm_port=_env_int("VLLM_PORT", 8888),
+            parakeet_port=_env_int("PARAKEET_PORT", 8000),
+            kokoro_port=_env_int("KOKORO_PORT", 8880),
+            embed_port=_env_int("EMBED_PORT", 8088),
+            qdrant_port=_env_int("QDRANT_PORT", 6333),
+            bind_port=_env_int("BIND_PORT", 9999),
            open_webui_url=_env("OPEN_WEBUI_URL", ""),
            ngc_api_key=_env("NGC_API_KEY", ""),
        )
@@ -1,3 +1,14 @@
 ARCHES := x86
 # overrides to s9pk.mk must precede the include statement
 include s9pk.mk
+
+# Publish the built s9pk to Gitea Releases (adopters pull it with a read-only
+# token instead of being hand-sent the package). Needs GITEA_URL + GITEA_TOKEN;
+# the vX.Y.Z git tag must already be pushed. See ../scripts/gitea-release.sh.
+RELEASE_VERSION := $(shell sed -n "s/.*version: '\([^']*\)'.*/\1/p" startos/versions/v0_1_0.ts)
+
+.PHONY: release
+release:
+	@test -f "$(PACKAGE_ID)_x86_64.s9pk" || { echo "Build first: make x86"; exit 1; }
+	GITEA_URL="$(GITEA_URL)" GITEA_TOKEN="$(GITEA_TOKEN)" \
+	  ../scripts/gitea-release.sh "$(RELEASE_VERSION)" "$(PACKAGE_ID)_x86_64.s9pk"
@@ -40,6 +40,15 @@ const inputSpec = InputSpec.of({
    placeholder: 'your SSH username',
    masked: false,
  }),
+  vllm_port: Value.text({
+    name: 'vLLM port (optional)',
+    description:
+      "The port your vLLM server listens on, on Spark 1 — used by the health check and the chat proxy. Leave blank to use 8888, which is what the bundled launch-cluster.sh wrapper uses. Set this to 8000 (vLLM's own default) or another port if your vLLM listens elsewhere.",
+    required: false,
+    default: null,
+    placeholder: 'leave blank for 8888',
+    masked: false,
+  }),
  parakeet_host: Value.text({
    name: 'Parakeet host (optional)',
    description:
@@ -7,6 +7,8 @@ export const sparkConfigSchema = z.object({
  spark1_user: z.string().catch(''),
  spark2_host: z.string().catch(''),
  spark2_user: z.string().catch(''),
+  // Optional vLLM port override (Spark 1). Blank => 8888 (launch-cluster.sh default).
+  vllm_port: z.string().catch(''),
  // Optional per-service overrides. Blank => use spark2_host / spark2_user.
  parakeet_host: z.string().catch(''),
  parakeet_user: z.string().catch(''),
@@ -13,6 +13,7 @@ export const main = sdk.setupMain(async ({ effects }) => {
    spark1_user: '',
    spark2_host: '',
    spark2_user: '',
+    vllm_port: '',
    parakeet_host: '',
    parakeet_user: '',
    parakeet_container: '',
@@ -50,6 +51,7 @@ export const main = sdk.setupMain(async ({ effects }) => {
        SPARK1_USER: cfg.spark1_user,
        SPARK2_HOST: cfg.spark2_host,
        SPARK2_USER: cfg.spark2_user,
+        VLLM_PORT: cfg.vllm_port,
        PARAKEET_HOST: cfg.parakeet_host,
        PARAKEET_USER: cfg.parakeet_user,
        PARAKEET_CONTAINER: cfg.parakeet_container,
@@ -1,10 +1,10 @@
 import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk'

 export const v0_1_0 = VersionInfo.of({
-  version: '0.21.0:1',
+  version: '0.22.0:0',
  releaseNotes: {
    en_US:
-      "v0.21.0:1 — matrix-bridge bot tile. The Matrix bot container on Spark 2 now appears as a tile under \"Always-on services\" with a live status badge (judged by the container itself, since the bot has no health port). Buttons: Update (pulls the latest code, rebuilds the image, and recreates the container — long-running, with a streamed log and a generous timeout), Restart, Stop/Start, and View logs (last 100 lines). Everything fails loud: a non-zero exit or stderr shows in the panel rather than a silent stall. To enable it, set the bot's SSH user (the owner of ~/matrix-bridge, e.g. 'modelo') in the Configure Sparks action — leave it blank and no tile appears, so this stays out of the way on systems that don't run the bot. New endpoints (LAN-only, browser-driven): POST /api/matrix-bridge/update (+ /{id} and /{id}/stream for progress), GET /api/matrix-bridge/logs. One-time setup on the Spark (owner): make ~/matrix-bridge a git clone of your Gitea repo, and — unless that SSH user is the same as your Spark 2 user — authorize this package's SSH public key for it (Show Public Key, then add it to that user's authorized_keys). There is no passwordless sudo on the Spark, so commands run directly as that user rather than via sudo.",
+      "v0.22.0:0 — configurable vLLM port. The port Spark Control uses to reach vLLM on Spark 1 (the health check and the chat proxy) is now a field in the Configure Sparks action, so you can point it at a vLLM that listens on a non-default port without rebuilding the package. Leave it blank to keep the previous default of 8888 — what the bundled launch-cluster.sh wrapper uses; set it to 8000 (vLLM's own default) or any other port if your vLLM listens elsewhere. Also hardened numeric-setting parsing so a blank or malformed port value falls back to its default instead of crashing daemon startup.",
  },
  migrations: {
    up: async ({ effects }) => {},
@@ -0,0 +1,45 @@
+#!/usr/bin/env bash
+# Publish a built Spark Control s9pk to Gitea Releases, so adopters can pull the
+# latest package with a read-only token instead of being hand-sent the file.
+#
+#   GITEA_URL=https://gitea.example:3000 GITEA_TOKEN=<write-token> \
+#     scripts/gitea-release.sh 0.22.0:0 package/spark-control_x86_64.s9pk
+#
+# The git tag (vX.Y.Z, derived from the version) must already exist and be pushed
+# (`git tag v0.22.0 && git push gitea v0.22.0`). Re-running is idempotent: it
+# reuses an existing release for the tag and replaces a same-named asset.
+set -euo pipefail
+
+VERSION="${1:-}"; S9PK="${2:-}"
+[ -n "$VERSION" ] && [ -n "$S9PK" ] || {
+  echo "usage: GITEA_URL=.. GITEA_TOKEN=.. $0 <version e.g. 0.22.0:0> <s9pk path>" >&2; exit 2; }
+: "${GITEA_URL:?set GITEA_URL to your Gitea base URL, e.g. https://gitea.lan:3000}"
+: "${GITEA_TOKEN:?set GITEA_TOKEN to a token with repository write access}"
+[ -f "$S9PK" ] || { echo "s9pk not found: $S9PK" >&2; exit 1; }
+
+TAG="v${VERSION%%:*}"                      # 0.22.0:0 -> v0.22.0
+ASSET="$(basename "$S9PK")"
+SLUG="$(git remote get-url gitea | sed -E 's#.*[:/]([^/:]+/[^/]+)\.git$#\1#')"  # grant/spark-control
+API="${GITEA_URL%/}/api/v1/repos/${SLUG}"
+AUTH=(-H "Authorization: token ${GITEA_TOKEN}")
+
+echo "repo ${SLUG} | tag ${TAG} | asset ${ASSET} | ${GITEA_URL}"
+
+# Reuse an existing release for this tag, otherwise create one.
+id="$(curl -fsS "${AUTH[@]}" "$API/releases/tags/$TAG" 2>/dev/null | jq -r '.id // empty')"
+if [ -z "$id" ]; then
+  id="$(curl -fsS -X POST "${AUTH[@]}" -H 'Content-Type: application/json' \
+    --data "$(jq -n --arg t "$TAG" --arg n "$VERSION" \
+      '{tag_name:$t, name:$n, body:("Spark Control "+$n+". See AGENTS.md / release notes.")}')" \
+    "$API/releases" | jq -r '.id')"
+fi
+[ -n "$id" ] && [ "$id" != null ] || { echo "could not obtain release id (check URL/token/tag)" >&2; exit 1; }
+
+# Replace a same-named asset so re-runs don't 409.
+old="$(curl -fsS "${AUTH[@]}" "$API/releases/$id/assets" | jq -r --arg n "$ASSET" '.[] | select(.name==$n) | .id')"
+[ -n "$old" ] && curl -fsS -X DELETE "${AUTH[@]}" "$API/releases/$id/assets/$old" >/dev/null || true
+
+curl -fsS -X POST "${AUTH[@]}" -F "attachment=@${S9PK};type=application/octet-stream" \
+  "$API/releases/$id/assets?name=$ASSET" >/dev/null
+
+echo "published: ${GITEA_URL%/}/${SLUG}/releases/tag/${TAG}"