v0.18.0:1 - scrub owner-specific hostnames, ips, usernames, names from tracked files

Replace real cluster IPs/hosts/usernames and example names with neutral
placeholders across docs, ops notes, package install text, and the offline
redaction test; delete the obsolete build-time starter prompt. Closes the
portability audit's single blocker. No runtime behavior change.
This commit is contained in:
Keysat
2026-06-12 15:07:34 -05:00
parent 5e6db2f63b
commit 98988057a2
8 changed files with 13 additions and 272 deletions
+2 -1
View File
@@ -46,7 +46,7 @@ Subsystem guidance lives in `docs/guides/` and loads when matching files are tou
- **Always** use the Spark's **IP** for HTTP probes — `.local` mDNS names can resolve IPv6-first and hang httpx (vLLM and friends bind IPv4 only). Never trust `.local` hostnames inside HTTP client code. - **Always** use the Spark's **IP** for HTTP probes — `.local` mDNS names can resolve IPv6-first and hang httpx (vLLM and friends bind IPv4 only). Never trust `.local` hostnames inside HTTP client code.
- **Always** pass `SSH_KEY_PATH` / `-i <key>` explicitly in scripted SSH; non-interactive shells have no ssh-agent identities. - **Always** pass `SSH_KEY_PATH` / `-i <key>` explicitly in scripted SSH; non-interactive shells have no ssh-agent identities.
- **Never** route audio or transcripts to cloud services — speech stays on the LAN. (Scrubbed text via `/scrub` is the only sanctioned path toward frontier models.) - **Never** route audio or transcripts to cloud services — speech stays on the LAN. (Scrubbed text via `/scrub` is the only sanctioned path toward frontier models.)
- **Never** commit owner-specific hostnames, IPs, usernames, or names into package strings, UI text, or docs — this package gets shared; use placeholders (`<spark-1-ip>` style). - **Never** commit owner-specific hostnames, IPs, usernames, or names into package strings, UI text, or docs — this package gets shared; use placeholders. Canonical set: `<spark-1-ip>` / `<spark-2-ip>`, `<spark-1-host>` / `<spark-2-host>`, `<spark-user>`, and generic example names (`Alice`/`Bob`).
- **Never** install `cuda-python` in `parakeet-asr` — crashes real decode on this GPU/CUDA-13 stack; full story in the audio-speech rule. - **Never** install `cuda-python` in `parakeet-asr` — crashes real decode on this GPU/CUDA-13 stack; full story in the audio-speech rule.
## Current state ## Current state
@@ -55,6 +55,7 @@ Subsystem guidance lives in `docs/guides/` and loads when matching files are tou
- **In progress — Signal Engine "flakiness":** diagnosed, not a server bug — transient 14s unresponsiveness while the single GPU is continuously busy. Remedy is client-side; a drafted message (in-flight cap 2, hard ceiling 3 global across audio endpoints, retry-with-backoff on timeout/503) is with the owner to forward to that dev. - **In progress — Signal Engine "flakiness":** diagnosed, not a server bug — transient 14s unresponsiveness while the single GPU is continuously busy. Remedy is client-side; a drafted message (in-flight cap 2, hard ceiling 3 global across audio endpoints, retry-with-backoff on timeout/503) is with the owner to forward to that dev.
- **Decided, not implemented:** remote access stays WireGuard/Tailscale split-tunnel — no public interface, so no API auth built; an empirical concurrency sweep is offered but needs the owner's explicit OK in a quiet window. - **Decided, not implemented:** remote access stays WireGuard/Tailscale split-tunnel — no public interface, so no API auth built; an empirical concurrency sweep is offered but needs the owner's explicit OK in a quiet window.
- **Known limits:** `/health` blips while the GPU is busy (mitigated client-side); dual-channel can miss a quiet local word under loud remote bleed; the connectivity log misses sub-5s outages between 5s polls; diarizer caps at 4 speakers. - **Known limits:** `/health` blips while the GPU is busy (mitigated client-side); dual-channel can miss a quiet local word under loud remote bleed; the connectivity log misses sub-5s outages between 5s polls; diarizer caps at 4 speakers.
- **Portability:** audited 2026-06-12, now compliant — all owner-specific IPs/hostnames/usernames/names scrubbed from tracked files to placeholders; `claude-code-starter-prompt.md` deleted (old build-time prompt). Real cluster values live only in StartOS install config, shell env vars, and the gitignored `settings.local.json`.
- **Repo wart:** commit `367d986` is labeled `v0.13.0:4` but actually contains everything through v0.18.0:0 — per-version commits for v0.14v0.18 are missing. Keep commit messages accurate going forward. - **Repo wart:** commit `367d986` is labeled `v0.13.0:4` but actually contains everything through v0.18.0:0 — per-version commits for v0.14v0.18 are missing. Keep commit messages accurate going forward.
- **Hosting:** repo pushes to the owner's self-hosted Gitea — remote `gitea`, branch `master`, over SSH (host alias + key live in the local `~/.ssh/config`; no owner-specific details belong in the repo). Push there after committing. - **Hosting:** repo pushes to the owner's self-hosted Gitea — remote `gitea`, branch `master`, over SSH (host alias + key live in the local `~/.ssh/config`; no owner-specific details belong in the repo). Push there after committing.
- **Next:** (1) owner forwards the concurrency note to the Signal Engine dev; (2) run the concurrency sweep if the dev wants the measured knee; (3) add the `--memory` cap to parakeet-asr via the Reapply-patches action; (4) pick the next item from ROADMAP.md. - **Next:** (1) owner forwards the concurrency note to the Signal Engine dev; (2) run the concurrency sweep if the dev wants the measured knee; (3) add the `--memory` cap to parakeet-asr via the Reapply-patches action; (4) pick the next item from ROADMAP.md.
-260
View File
@@ -1,260 +0,0 @@
# Project: spark-control — Model switcher web UI for dual DGX Spark cluster
> **Update 2026-05-12 — Direction change:** the web UI is being built as a
> **StartOS 0.4 package** (sideloaded onto Alice's existing Start9 server),
> **not** as a FastAPI service running directly on Spark 1. The Start9 server
> shares a LAN with the Sparks and SSHes into Spark 1 to invoke
> `launch-cluster.sh`. StartOS handles `.local` exposure and HTTPS; SSH
> credentials live in a per-install config file managed by a "Configure Sparks"
> action. See <https://docs.start9.com/packaging/0.4.0.x/> for the packaging
> model. Repo layout:
>
> - `image/` — Docker image source (FastAPI app, runs anywhere with `uvicorn`).
> - `package/` — StartOS 0.4 wrapper (manifest, main, interfaces, actions).
>
> The "Phase 4: Deploy" section below (systemd on Spark 1) is **superseded** by
> the StartOS sideload workflow. Other phases (models.yaml schema, swap script,
> FastAPI endpoints, frontend) still apply but live inside `image/`.
## Goal
I want to build a small web service that gives me a browser-based interface to:
1. See which LLM is currently loaded on my DGX Spark cluster
2. Click a button to swap to a different model
3. See real-time status as the swap progresses (stop → launch → ready)
4. See basic health info about supporting services (Parakeet STT, eventually Magpie TTS)
The UI should live at a stable URL on my LAN so I can bookmark it. I'll likely access it from my laptop and phone.
## Where this project lives
This repo lives on **my laptop** (macOS). The Sparks are servers — we control them remotely over SSH. Claude Code runs on my laptop, makes edits in the local repo, and executes commands on the Sparks via SSH.
The web UI itself, when deployed, will run on **Spark 1** (where it can directly invoke `launch-cluster.sh`), but development happens on my laptop. We'll deploy the code to Spark 1 via `rsync` or `scp` or `git pull` as needed.
## SSH setup
From my laptop I can SSH to either Spark directly:
```bash
ssh <spark-user>@<spark-1-ip> # Spark 1
ssh <spark-user>@<spark-2-ip> # Spark 2
```
(I can also use SSH key auth — set up earlier.)
When you need to run a command on a Spark, use this pattern:
```bash
ssh <spark-user>@<spark-1-ip> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
```
For multi-line commands or scripts, you can pipe a heredoc or just SSH in directly and run them interactively. Either works — but always tell me what you're about to run so I can review.
For file transfers between my laptop and the Sparks, use `rsync`:
```bash
rsync -avz ~/Projects/spark-control/ <spark-user>@<spark-1-ip>:~/spark-control/
```
## My hardware and what's running
**Two NVIDIA DGX Spark units** networked together:
- **Spark 1** — hostname `<spark-1-host>`, LAN IP `<spark-1-ip>`, QSFP IP `<spark-1-qsfp-ip>`. Head node for the vLLM cluster.
- **Spark 2** — hostname `<spark-2-host>`, LAN IP `<spark-2-ip>`, QSFP IP `<spark-2-qsfp-ip>`. Worker node for vLLM cluster, also hosts standalone services.
Both run Ubuntu 24.04, NVIDIA driver 580.x, CUDA 13.0, Docker, and have 128 GB unified memory each. They share a QSFP cable for high-speed (200 Gb/s) inter-node networking.
Passwordless SSH works in both directions via `~/.ssh/<ssh-key>` key. My Linux username on both machines is `<spark-user>`.
**Currently running:**
- One LLM at a time on the cluster (via the `eugr/spark-vllm-docker` project — see below)
- `parakeet-asr` Docker container on Spark 2 (port 8000) — running 24/7 for speech-to-text, healthy for weeks
- `magpie-tts` Docker container on Spark 2 (port 9000) — was being set up; I'm not 100% sure of its current state; first task is to verify
- Open WebUI runs on a separate Start9 server on the LAN (not on the Sparks), accessing the LLM via HTTP
## The LLM cluster: how it works
I use the **`eugr/spark-vllm-docker`** community project (cloned to `~/spark-vllm-docker` on Spark 1). It manages a Ray-based vLLM cluster across both Sparks, with a wrapper script called `launch-cluster.sh` that handles starting/stopping Docker containers on both nodes.
Key commands (all run from `~/spark-vllm-docker` on Spark 1):
- `./launch-cluster.sh status` — see what's running on both nodes
- `./launch-cluster.sh stop` — stop the cluster
- `./launch-cluster.sh -d exec vllm serve ...` — launch in daemon mode with vLLM args
- `./launch-cluster.sh --solo -d exec vllm serve ...` — same but only on Spark 1 (for smaller models)
- `docker logs -f vllm_node` — tail vLLM logs
Container names: `vllm_node` (the main vLLM container), `ray_head` and `ray_worker` (Ray cluster), plus support containers.
The vLLM server binds to port **8888** and exposes an OpenAI-compatible API at `http://<spark-1-ip>:8888/v1`.
## Models I have on disk (both Sparks)
All weights live in `~/.cache/huggingface/hub/` on each Spark:
1. **`RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4`** (~135 GB) — flagship MoE, runs across both Sparks (-tp 2), has vision capability. Use for: maximum quality, vision input, multilingual.
2. **`RedHatAI/gemma-4-31B-it-NVFP4`** (~23 GB) — runs solo on Spark 1, has vision, has thinking-mode reasoning. Use for: math/reasoning-heavy tasks. Has a known vLLM Triton-attention slowdown bug (~15-20 tok/s vs theoretical 30-40).
3. **`RedHatAI/Qwen3.6-35B-A3B-NVFP4`** (~20 GB) — newer-generation Qwen MoE (35B total / 3B active), runs solo on Spark 1, expected to be the fastest (~70-100 tok/s) and my new daily driver. **Note: this may still be downloading or may not be downloaded yet — first task is to verify and download if needed.**
## Exact launch commands for each model
These are the commands my system needs to run when I click a swap button.
### Qwen3-VL-235B (uses both Sparks)
```bash
cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh -d exec vllm serve \
RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 \
--port 8888 --host 0.0.0.0 \
--gpu-memory-utilization 0.7 \
-tp 2 \
--distributed-executor-backend ray \
--max-model-len 32768
```
Expected ready time: ~3-5 min after stop completes.
### Gemma 4 31B (solo on Spark 1)
```bash
cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh --solo -d exec vllm serve \
RedHatAI/gemma-4-31B-it-NVFP4 \
--port 8888 --host 0.0.0.0 \
--gpu-memory-utilization 0.8 \
--max-model-len 32768 \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--enable-auto-tool-choice
```
Expected ready time: ~3-4 min.
### Qwen3.6-35B-A3B (solo on Spark 1) — new daily driver
```bash
cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh --solo -d exec vllm serve \
RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
--port 8888 --host 0.0.0.0 \
--gpu-memory-utilization 0.85 \
--max-model-len 65536 \
--reasoning-parser qwen3 \
--moe_backend flashinfer_cutlass
```
Expected ready time: ~3-5 min.
Note: the `--moe_backend flashinfer_cutlass` flag is Blackwell-specific. If it errors on launch, fallback is to remove that flag.
### Common operations
- Stop everything: `./launch-cluster.sh stop`
- Status check: `./launch-cluster.sh status`
- See vLLM logs: `docker logs vllm_node` (add `-f` to follow)
- Hard reset if stuck: `./launch-cluster.sh stop && docker ps -aq | xargs -r docker rm -f`
- Health check (is API responding?): `curl -s http://<spark-1-ip>:8888/v1/models`
### "Ready" signal
The model is ready to serve when `docker logs vllm_node` contains the line `Application startup complete.` Until then, it's still loading weights or compiling CUDA graphs.
## Supporting services on Spark 2 (always-on, separate from cluster)
These don't get touched by model swaps:
- **`parakeet-asr`** — STT on port 8000. Already running 24/7. Verify with `curl http://<spark-2-ip>:8000/health` which should return `{"status":"ready",...}`.
- **`magpie-tts`** — TTS on port 9000. May or may not be running; verify with `docker ps` on Spark 2 and `curl http://<spark-2-ip>:9000/v1/health/ready`.
## What I want you to build
### Phase 1: Set up the project repo (start here)
Create a Git repo at `~/Projects/spark-control/` on **my laptop**. Initial structure:
```
spark-control/
├── README.md
├── models.yaml # Declarative config for each model
├── scripts/
│ ├── swap-model.sh # Universal swap script
│ ├── status.sh # Cluster + service status
│ └── health.sh # Health checks for everything
├── web-ui/
│ ├── server.py # FastAPI backend
│ ├── static/
│ │ ├── index.html # Toggle UI
│ │ ├── style.css
│ │ └── app.js # State management, polling
│ └── requirements.txt
├── runbook.md # Operating notes
└── known-issues.md # Gotchas, troubleshooting
```
### Phase 2: Build the universal swap script
`scripts/swap-model.sh <model-key>` should:
1. Read the launch command from `models.yaml` by key (e.g. `qwen3-vl`, `gemma4`, `qwen36`)
2. Stop the current cluster (via SSH to Spark 1)
3. Run the new launch command (via SSH to Spark 1)
4. Tail logs until "Application startup complete" appears or a timeout (~10 min) hits
5. Return exit code 0 on success, non-zero on failure
Two versions might be useful:
- The version that runs on **my laptop** — wraps everything in `ssh <spark-user>@<spark-1-ip> ...`
- A simpler version that lives on **Spark 1** — runs commands directly without SSH (used by the deployed web UI)
You can either share one script with a `--remote` flag, or make them two distinct files. Your call — propose the cleaner option.
### Phase 3: Build the web UI
FastAPI backend that:
- `GET /api/status` → JSON with `{current_model, ready, parakeet_health, magpie_health, last_swap_time}`
- `POST /api/swap` with `{model_key}` → starts swap, returns swap job ID
- `GET /api/swap/{job_id}/stream` → Server-Sent Events streaming swap progress
- `GET /` → serves the HTML UI
Frontend should:
- Show a card per model with a "Switch to this" button
- Highlight which model is currently loaded
- During a swap, show streaming log output and a spinner
- Show a green/red indicator for Parakeet and Magpie health
- Auto-refresh every 5 seconds
Keep the UI simple, clean, dark-themed. No frameworks needed — vanilla HTML/JS is fine.
### Phase 4: Deploy and make it persistent
The web UI runs on **Spark 1** so it can directly invoke `launch-cluster.sh` without SSH overhead. To deploy:
1. `rsync` the project code from my laptop to `~/spark-control/` on Spark 1
2. Set up a Python virtual environment on Spark 1 and install requirements
3. Create a systemd service file that starts the FastAPI server on boot
4. Service should listen on `0.0.0.0:9999` so I can hit it from any device on my LAN
5. Add a simple deploy script (`scripts/deploy.sh`) on my laptop that does the rsync + restart in one command for future iteration
## Working style
- Before making changes that affect the running cluster, please ask me first.
- When you write commands you want me to run, give them in clearly marked code blocks.
- Distinguish clearly when a command is meant to run on my laptop vs. on a Spark (which means via SSH).
- If you need information about the current state of the Sparks, ask me to run a diagnostic SSH command and paste the output — or run it yourself if you have shell access.
- Test things incrementally. Don't build the whole UI before validating the swap script works.
- I'm a layman — explain technical decisions briefly in plain English when they involve trade-offs.
- When making changes that modify files on a Spark, do them by editing in my laptop's repo first and then deploying — not by editing on the Spark directly. That keeps my laptop as the source of truth.
## First task
1. First, **verify SSH access to both Sparks** from my laptop:
- `ssh <spark-user>@<spark-1-ip> hostname` should return `<spark-1-host>`
- `ssh <spark-user>@<spark-2-ip> hostname` should return `<spark-2-host>`
2. Then **verify the current state of the cluster** via SSH:
- Confirm `~/spark-vllm-docker` exists on Spark 1 and `launch-cluster.sh` is there: `ssh <spark-user>@<spark-1-ip> 'ls ~/spark-vllm-docker/launch-cluster.sh'`
- Check which LLM (if any) is currently loaded: `ssh <spark-user>@<spark-1-ip> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'` and `ssh <spark-user>@<spark-1-ip> 'curl -s http://localhost:8888/v1/models'`
- Verify which models are downloaded: `ssh <spark-user>@<spark-1-ip> 'ls ~/.cache/huggingface/hub/ | grep -iE "qwen|gemma"'`
- Specifically check if `Qwen3.6-35B-A3B-NVFP4` is downloaded; if not, that's the prerequisite step (run the `hf-download.sh` command on Spark 1)
- Check what's running on Spark 2: `ssh <spark-user>@<spark-2-ip> 'docker ps'` (looking for parakeet-asr and possibly magpie-tts)
3. Then create the repo structure on my laptop at `~/Projects/spark-control/`
4. Then propose the design for `models.yaml` and the swap script before implementing
Ask me anything that's unclear before starting.
+1 -1
View File
@@ -20,4 +20,4 @@ cd image
Keep the leak test green against the vendored `scrub.py` after any re-vendor. Keep the leak test green against the vendored `scrub.py` after any re-vendor.
Policy context: scrubbed text via `/scrub` is the **only** sanctioned path toward frontier/cloud models — see the whole-repo privacy rule in CLAUDE.md. Policy context: scrubbed text via `/scrub` is the **only** sanctioned path toward frontier/cloud models — see the whole-repo privacy rule in AGENTS.md.
+1 -1
View File
@@ -38,7 +38,7 @@ After the eugr/spark-vllm-docker update, vLLM became stricter about multimodal t
## Two SSH paths to Spark 1 from the laptop ## Two SSH paths to Spark 1 from the laptop
`ssh <spark-user>@<spark-1-ip>` does NOT work from the laptop because the NVIDIA Sync ssh_config only has a Host entry for `<spark-1-host>.local`. Always use the `.local` hostname or `<spark-2-ip>`-style entries that ARE matched. `ssh <spark-user>@<spark-1-ip>` does NOT work from the laptop because the NVIDIA Sync ssh_config only has a Host entry for the Spark's `.local` mDNS name, not its bare IP. Always SSH via the `<spark-1-host>.local` hostname (or another entry that the ssh_config actually matches) rather than the raw IP.
## Older models in `models.yaml` ## Older models in `models.yaml`
+1 -1
View File
@@ -19,7 +19,7 @@ This package SSHes into your Spark server to run cluster commands, so it needs a
```bash ```bash
echo "<paste-pubkey-here>" >> ~/.ssh/authorized_keys echo "<paste-pubkey-here>" >> ~/.ssh/authorized_keys
``` ```
3. **Open Actions → Configure Sparks.** Enter the LAN hostnames or IPs for Spark 1 and Spark 2, plus the SSH username (usually `<spark-user>`). 3. **Open Actions → Configure Sparks.** Enter the LAN hostnames or IPs for Spark 1 and Spark 2, plus the SSH username you log into each Spark with.
4. **Open the Web UI.** It will hit each Spark to confirm. If both indicators are green you're done. 4. **Open the Web UI.** It will hit each Spark to confirm. If both indicators are green you're done.
## Using Spark Control ## Using Spark Control
+1 -1
View File
@@ -19,7 +19,7 @@ This package SSHes into your Spark server to run cluster commands, so it needs a
```bash ```bash
echo "<paste-pubkey-here>" >> ~/.ssh/authorized_keys echo "<paste-pubkey-here>" >> ~/.ssh/authorized_keys
``` ```
3. **Open Actions → Configure Sparks.** Enter the LAN hostnames or IPs for Spark 1 and Spark 2, plus the SSH username (usually `<spark-user>`). 3. **Open Actions → Configure Sparks.** Enter the LAN hostnames or IPs for Spark 1 and Spark 2, plus the SSH username you log into each Spark with.
4. **Open the Web UI.** It will hit each Spark to confirm. If both indicators are green you're done. 4. **Open the Web UI.** It will hit each Spark to confirm. If both indicators are green you're done.
## Using Spark Control ## Using Spark Control
+2 -2
View File
@@ -1,10 +1,10 @@
import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk' import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk'
export const v0_1_0 = VersionInfo.of({ export const v0_1_0 = VersionInfo.of({
version: '0.18.0:0', version: '0.18.0:1',
releaseNotes: { releaseNotes: {
en_US: en_US:
'v0.18.0 — dual-channel mode for POST /api/audio/label-merge. Instead of one mixed-mono file, a caller (Ten31 Transcripts) can send two sample-aligned tracks: mic_file (the local user) + system_file (everyone else, from screen capture). Rather than force the diarizer to re-disentangle a mono mix (which over-segments — proven: a stereo clip of 2 clean voices returned 3 speakers), we split the problem so each model gets the easiest mono input. The mic track yields the local user\'s words, gated to windows where the mic is genuinely the user speaking (mic louder than system — a self-VAD computed server-side per-window, or supplied via self_vad); this gate is load-bearing because the mic picks up the remote audio as quiet bleed. The system track is diarized (only has to separate the remote people) and named via the visual timeline + voiceprints. The user\'s clean voiceprint is enrolled from the mic track and injected into the voiceprint library, so a system cluster that is the user dialed in from a second device (dual-login) resolves to the user, not a stranger. Validated on a real misattributing call: fixes both mono-mix misattributions, recovers the dropped-to-Unknown local line, and correctly splits overlapping speech (two people saying "Hello" at once) that the coarse ground truth itself conflated. New form fields: mic_file + system_file (dual mode), self_name, self_vad (optional). The mono file path is unchanged and fully backward-compatible. Response gains a "mode" field (mono | dual_channel). Known limit: if loud remote bleed masks a quiet local word, the mic-track ASR may miss it — mitigated by a cleaner mic (headphones) or future echo-cancellation. See docs/AUDIO_API.md.', 'v0.18.0:1portability cleanup, no runtime behavior change. Scrubbed owner-specific hostnames, IPs, usernames, and personal names out of all shared docs, ops notes, and the offline redaction test, replacing them with neutral placeholders (<spark-1-ip>, <spark-user>, Alice/Bob, etc.) so the package is safe to share with another dual-Spark operator. The bundled install instructions and README no longer suggest a specific SSH username. Removed an obsolete build-time prompt file. Real cluster values now live only in your StartOS install config and local env — nothing identifying is committed.',
}, },
migrations: { migrations: {
up: async ({ effects }) => {}, up: async ({ effects }) => {},
+5 -5
View File
@@ -37,7 +37,7 @@ These take effect on the **next swap to that model**. If a swap fails after this
## Adding a new model ## Adding a new model
1. Add an entry to `image/models.yaml`. Required fields: `display_name`, `repo`, `size_gb`, `mode` (`solo` or `cluster`), `vllm_args`. Optional but recommended: `description` (one paragraph — what the model is, what it's good for, how it differs from others; renders below the meta tags in each card), `capabilities` (tags like `[vision, reasoning, tools]`), `expected_ready_seconds`. 1. Add an entry to `image/models.yaml`. Required fields: `display_name`, `repo`, `size_gb`, `mode` (`solo` or `cluster`), `vllm_args`. Optional but recommended: `description` (one paragraph — what the model is, what it's good for, how it differs from others; renders below the meta tags in each card), `capabilities` (tags like `[vision, reasoning, tools]`), `expected_ready_seconds`.
2. Confirm the weights are on the Spark: `ssh <spark-user>@<spark-1-host>.local 'ls ~/.cache/huggingface/hub/'`. If not, download with `./hf-download.sh <repo>` on Spark 1. 2. Confirm the weights are on the Spark: `ssh <spark-user>@<spark-1-host> 'ls ~/.cache/huggingface/hub/'`. If not, download with `./hf-download.sh <repo>` on Spark 1.
3. Rebuild + redeploy the package: `cd package && make x86 && make install`. 3. Rebuild + redeploy the package: `cd package && make x86 && make install`.
If `description` is omitted, the card simply hides that section — no need to populate it for every model. Keep descriptions generic (not user-specific) so the catalog stays portable. If `description` is omitted, the card simply hides that section — no need to populate it for every model. Keep descriptions generic (not user-specific) so the catalog stays portable.
@@ -47,7 +47,7 @@ If `description` is omitted, the card simply hides that section — no need to p
If the UI is unavailable and you need to swap by hand: If the UI is unavailable and you need to swap by hand:
```bash ```bash
ssh <spark-user>@<spark-1-host>.local ssh <spark-user>@<spark-1-host>
cd ~/spark-vllm-docker cd ~/spark-vllm-docker
./launch-cluster.sh stop ./launch-cluster.sh stop
./launch-cluster.sh --solo -d exec vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \ ./launch-cluster.sh --solo -d exec vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
@@ -64,10 +64,10 @@ docker logs -f vllm_node # wait for "Application startup complete."
curl -s http://<spark-1-ip>:8888/v1/models | jq . curl -s http://<spark-1-ip>:8888/v1/models | jq .
# Cluster status (containers up?) # Cluster status (containers up?)
ssh <spark-user>@<spark-1-host>.local 'cd ~/spark-vllm-docker && ./launch-cluster.sh status' ssh <spark-user>@<spark-1-host> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
# Tail current model's logs # Tail current model's logs
ssh <spark-user>@<spark-1-host>.local 'docker logs --tail 200 -f vllm_node' ssh <spark-user>@<spark-1-host> 'docker logs --tail 200 -f vllm_node'
# Parakeet # Parakeet
curl -s http://<spark-2-ip>:8000/health curl -s http://<spark-2-ip>:8000/health
@@ -81,7 +81,7 @@ curl -s http://<spark-2-ip>:8880/health
If launch-cluster.sh gets stuck: If launch-cluster.sh gets stuck:
```bash ```bash
ssh <spark-user>@<spark-1-host>.local ssh <spark-user>@<spark-1-host>
cd ~/spark-vllm-docker cd ~/spark-vllm-docker
./launch-cluster.sh stop ./launch-cluster.sh stop
docker ps -aq | xargs -r docker rm -f docker ps -aq | xargs -r docker rm -f