Initial scaffold: image/ FastAPI app, models.yaml, docs

- image/ FastAPI app: /api/status, /api/swap, /api/swap/{id}/stream, /api/test-connection
- models.yaml: 5-model catalog (qwen3-vl, gemma4, qwen36, qwen3-235b-fp8, qwen25-72b)
- README, runbook, known-issues
- Dry-run swap verified against live Spark 1 (gemma4 currently loaded)
This commit is contained in:
Grant
2026-05-12 09:29:13 -05:00
commit ae8efa1754
19 changed files with 1500 additions and 0 deletions
+14
View File
@@ -0,0 +1,14 @@
__pycache__/
*.py[cod]
*.egg-info/
.venv/
venv/
.env
.env.*
!.env.example
node_modules/
*.s9pk
dist/
build/
.DS_Store
.claude/
+55
View File
@@ -0,0 +1,55 @@
# spark-control
A browser-based control panel for a dual-DGX-Spark vLLM cluster. Designed to run as a [StartOS 0.4](https://docs.start9.com/packaging/0.4.0.x/) package on a Start9 server on the same LAN as the Sparks.
## What it does
- Shows which LLM is currently loaded on the cluster (`:8888/v1/models`).
- Click to swap to a different model — stops the current one, launches the new one, streams logs to the UI until `Application startup complete.` appears.
- Surfaces health for Parakeet (STT, `:8000`) and Magpie (TTS, `:9000`) on Spark 2.
## Architecture
```
[Browser/phone] ──► [StartOS reverse proxy] ──► [spark-control container]
│ (SSH over LAN)
[Spark 1] ──► launch-cluster.sh
[Spark 2]
```
Two layers in this repo:
- `image/` — a self-contained FastAPI app + static UI. Runs anywhere with `uvicorn` and an SSH client. Useful for development.
- `package/` — a thin StartOS 0.4 wrapper that packages the image, exposes the UI on the LAN, and gives the user actions to configure SSH access to the Sparks.
## Quick start (local dev, no StartOS yet)
```bash
cd image
python -m venv .venv && source .venv/bin/activate
pip install -e .
export SPARK1_HOST=<spark-1-host>.local
export SPARK1_USER=<spark-user>
export SPARK2_HOST=<spark-2-ip>
export SPARK2_USER=<spark-user>
export SSH_KEY_PATH="$HOME/Library/Application Support/NVIDIA/Sync/config/nvsync.key"
uvicorn app.server:app --host 0.0.0.0 --port 9999 --reload
```
Open <http://localhost:9999>.
## Repo layout
- `models.yaml` — model catalog (also bundled into the image)
- `image/` — Docker image source
- `package/` — StartOS 0.4 package source
- `scripts/build-s9pk.sh` — convenience wrapper around the StartOS build
- `runbook.md` — operating notes
- `known-issues.md` — known quirks and workarounds
## Status
v0.1 — local-only, single-cluster, no auth (trusts LAN).
+244
View File
@@ -0,0 +1,244 @@
# Project: spark-control — Model switcher web UI for dual DGX Spark cluster
## Goal
I want to build a small web service that gives me a browser-based interface to:
1. See which LLM is currently loaded on my DGX Spark cluster
2. Click a button to swap to a different model
3. See real-time status as the swap progresses (stop → launch → ready)
4. See basic health info about supporting services (Parakeet STT, eventually Magpie TTS)
The UI should live at a stable URL on my LAN so I can bookmark it. I'll likely access it from my laptop and phone.
## Where this project lives
This repo lives on **my laptop** (macOS). The Sparks are servers — we control them remotely over SSH. Claude Code runs on my laptop, makes edits in the local repo, and executes commands on the Sparks via SSH.
The web UI itself, when deployed, will run on **Spark 1** (where it can directly invoke `launch-cluster.sh`), but development happens on my laptop. We'll deploy the code to Spark 1 via `rsync` or `scp` or `git pull` as needed.
## SSH setup
From my laptop I can SSH to either Spark directly:
```bash
ssh <spark-user>@<spark-1-ip> # Spark 1
ssh <spark-user>@<spark-2-ip> # Spark 2
```
(I can also use SSH key auth — set up earlier.)
When you need to run a command on a Spark, use this pattern:
```bash
ssh <spark-user>@<spark-1-ip> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
```
For multi-line commands or scripts, you can pipe a heredoc or just SSH in directly and run them interactively. Either works — but always tell me what you're about to run so I can review.
For file transfers between my laptop and the Sparks, use `rsync`:
```bash
rsync -avz ~/Projects/spark-control/ <spark-user>@<spark-1-ip>:~/spark-control/
```
## My hardware and what's running
**Two NVIDIA DGX Spark units** networked together:
- **Spark 1** — hostname `<spark-1-host>`, LAN IP `<spark-1-ip>`, QSFP IP `<spark-1-qsfp-ip>`. Head node for the vLLM cluster.
- **Spark 2** — hostname `<spark-2-host>`, LAN IP `<spark-2-ip>`, QSFP IP `<spark-2-qsfp-ip>`. Worker node for vLLM cluster, also hosts standalone services.
Both run Ubuntu 24.04, NVIDIA driver 580.x, CUDA 13.0, Docker, and have 128 GB unified memory each. They share a QSFP cable for high-speed (200 Gb/s) inter-node networking.
Passwordless SSH works in both directions via `~/.ssh/<ssh-key>` key. My Linux username on both machines is `<spark-user>`.
**Currently running:**
- One LLM at a time on the cluster (via the `eugr/spark-vllm-docker` project — see below)
- `parakeet-asr` Docker container on Spark 2 (port 8000) — running 24/7 for speech-to-text, healthy for weeks
- `magpie-tts` Docker container on Spark 2 (port 9000) — was being set up; I'm not 100% sure of its current state; first task is to verify
- Open WebUI runs on a separate Start9 server on the LAN (not on the Sparks), accessing the LLM via HTTP
## The LLM cluster: how it works
I use the **`eugr/spark-vllm-docker`** community project (cloned to `~/spark-vllm-docker` on Spark 1). It manages a Ray-based vLLM cluster across both Sparks, with a wrapper script called `launch-cluster.sh` that handles starting/stopping Docker containers on both nodes.
Key commands (all run from `~/spark-vllm-docker` on Spark 1):
- `./launch-cluster.sh status` — see what's running on both nodes
- `./launch-cluster.sh stop` — stop the cluster
- `./launch-cluster.sh -d exec vllm serve ...` — launch in daemon mode with vLLM args
- `./launch-cluster.sh --solo -d exec vllm serve ...` — same but only on Spark 1 (for smaller models)
- `docker logs -f vllm_node` — tail vLLM logs
Container names: `vllm_node` (the main vLLM container), `ray_head` and `ray_worker` (Ray cluster), plus support containers.
The vLLM server binds to port **8888** and exposes an OpenAI-compatible API at `http://<spark-1-ip>:8888/v1`.
## Models I have on disk (both Sparks)
All weights live in `~/.cache/huggingface/hub/` on each Spark:
1. **`RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4`** (~135 GB) — flagship MoE, runs across both Sparks (-tp 2), has vision capability. Use for: maximum quality, vision input, multilingual.
2. **`RedHatAI/gemma-4-31B-it-NVFP4`** (~23 GB) — runs solo on Spark 1, has vision, has thinking-mode reasoning. Use for: math/reasoning-heavy tasks. Has a known vLLM Triton-attention slowdown bug (~15-20 tok/s vs theoretical 30-40).
3. **`RedHatAI/Qwen3.6-35B-A3B-NVFP4`** (~20 GB) — newer-generation Qwen MoE (35B total / 3B active), runs solo on Spark 1, expected to be the fastest (~70-100 tok/s) and my new daily driver. **Note: this may still be downloading or may not be downloaded yet — first task is to verify and download if needed.**
## Exact launch commands for each model
These are the commands my system needs to run when I click a swap button.
### Qwen3-VL-235B (uses both Sparks)
```bash
cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh -d exec vllm serve \
RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 \
--port 8888 --host 0.0.0.0 \
--gpu-memory-utilization 0.7 \
-tp 2 \
--distributed-executor-backend ray \
--max-model-len 32768
```
Expected ready time: ~3-5 min after stop completes.
### Gemma 4 31B (solo on Spark 1)
```bash
cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh --solo -d exec vllm serve \
RedHatAI/gemma-4-31B-it-NVFP4 \
--port 8888 --host 0.0.0.0 \
--gpu-memory-utilization 0.8 \
--max-model-len 32768 \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--enable-auto-tool-choice
```
Expected ready time: ~3-4 min.
### Qwen3.6-35B-A3B (solo on Spark 1) — new daily driver
```bash
cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh --solo -d exec vllm serve \
RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
--port 8888 --host 0.0.0.0 \
--gpu-memory-utilization 0.85 \
--max-model-len 65536 \
--reasoning-parser qwen3 \
--moe_backend flashinfer_cutlass
```
Expected ready time: ~3-5 min.
Note: the `--moe_backend flashinfer_cutlass` flag is Blackwell-specific. If it errors on launch, fallback is to remove that flag.
### Common operations
- Stop everything: `./launch-cluster.sh stop`
- Status check: `./launch-cluster.sh status`
- See vLLM logs: `docker logs vllm_node` (add `-f` to follow)
- Hard reset if stuck: `./launch-cluster.sh stop && docker ps -aq | xargs -r docker rm -f`
- Health check (is API responding?): `curl -s http://<spark-1-ip>:8888/v1/models`
### "Ready" signal
The model is ready to serve when `docker logs vllm_node` contains the line `Application startup complete.` Until then, it's still loading weights or compiling CUDA graphs.
## Supporting services on Spark 2 (always-on, separate from cluster)
These don't get touched by model swaps:
- **`parakeet-asr`** — STT on port 8000. Already running 24/7. Verify with `curl http://<spark-2-ip>:8000/health` which should return `{"status":"ready",...}`.
- **`magpie-tts`** — TTS on port 9000. May or may not be running; verify with `docker ps` on Spark 2 and `curl http://<spark-2-ip>:9000/v1/health/ready`.
## What I want you to build
### Phase 1: Set up the project repo (start here)
Create a Git repo at `~/Projects/spark-control/` on **my laptop**. Initial structure:
```
spark-control/
├── README.md
├── models.yaml # Declarative config for each model
├── scripts/
│ ├── swap-model.sh # Universal swap script
│ ├── status.sh # Cluster + service status
│ └── health.sh # Health checks for everything
├── web-ui/
│ ├── server.py # FastAPI backend
│ ├── static/
│ │ ├── index.html # Toggle UI
│ │ ├── style.css
│ │ └── app.js # State management, polling
│ └── requirements.txt
├── runbook.md # Operating notes
└── known-issues.md # Gotchas, troubleshooting
```
### Phase 2: Build the universal swap script
`scripts/swap-model.sh <model-key>` should:
1. Read the launch command from `models.yaml` by key (e.g. `qwen3-vl`, `gemma4`, `qwen36`)
2. Stop the current cluster (via SSH to Spark 1)
3. Run the new launch command (via SSH to Spark 1)
4. Tail logs until "Application startup complete" appears or a timeout (~10 min) hits
5. Return exit code 0 on success, non-zero on failure
Two versions might be useful:
- The version that runs on **my laptop** — wraps everything in `ssh <spark-user>@<spark-1-ip> ...`
- A simpler version that lives on **Spark 1** — runs commands directly without SSH (used by the deployed web UI)
You can either share one script with a `--remote` flag, or make them two distinct files. Your call — propose the cleaner option.
### Phase 3: Build the web UI
FastAPI backend that:
- `GET /api/status` → JSON with `{current_model, ready, parakeet_health, magpie_health, last_swap_time}`
- `POST /api/swap` with `{model_key}` → starts swap, returns swap job ID
- `GET /api/swap/{job_id}/stream` → Server-Sent Events streaming swap progress
- `GET /` → serves the HTML UI
Frontend should:
- Show a card per model with a "Switch to this" button
- Highlight which model is currently loaded
- During a swap, show streaming log output and a spinner
- Show a green/red indicator for Parakeet and Magpie health
- Auto-refresh every 5 seconds
Keep the UI simple, clean, dark-themed. No frameworks needed — vanilla HTML/JS is fine.
### Phase 4: Deploy and make it persistent
The web UI runs on **Spark 1** so it can directly invoke `launch-cluster.sh` without SSH overhead. To deploy:
1. `rsync` the project code from my laptop to `~/spark-control/` on Spark 1
2. Set up a Python virtual environment on Spark 1 and install requirements
3. Create a systemd service file that starts the FastAPI server on boot
4. Service should listen on `0.0.0.0:9999` so I can hit it from any device on my LAN
5. Add a simple deploy script (`scripts/deploy.sh`) on my laptop that does the rsync + restart in one command for future iteration
## Working style
- Before making changes that affect the running cluster, please ask me first.
- When you write commands you want me to run, give them in clearly marked code blocks.
- Distinguish clearly when a command is meant to run on my laptop vs. on a Spark (which means via SSH).
- If you need information about the current state of the Sparks, ask me to run a diagnostic SSH command and paste the output — or run it yourself if you have shell access.
- Test things incrementally. Don't build the whole UI before validating the swap script works.
- I'm a layman — explain technical decisions briefly in plain English when they involve trade-offs.
- When making changes that modify files on a Spark, do them by editing in my laptop's repo first and then deploying — not by editing on the Spark directly. That keeps my laptop as the source of truth.
## First task
1. First, **verify SSH access to both Sparks** from my laptop:
- `ssh <spark-user>@<spark-1-ip> hostname` should return `<spark-1-host>`
- `ssh <spark-user>@<spark-2-ip> hostname` should return `<spark-2-host>`
2. Then **verify the current state of the cluster** via SSH:
- Confirm `~/spark-vllm-docker` exists on Spark 1 and `launch-cluster.sh` is there: `ssh <spark-user>@<spark-1-ip> 'ls ~/spark-vllm-docker/launch-cluster.sh'`
- Check which LLM (if any) is currently loaded: `ssh <spark-user>@<spark-1-ip> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'` and `ssh <spark-user>@<spark-1-ip> 'curl -s http://localhost:8888/v1/models'`
- Verify which models are downloaded: `ssh <spark-user>@<spark-1-ip> 'ls ~/.cache/huggingface/hub/ | grep -iE "qwen|gemma"'`
- Specifically check if `Qwen3.6-35B-A3B-NVFP4` is downloaded; if not, that's the prerequisite step (run the `hf-download.sh` command on Spark 1)
- Check what's running on Spark 2: `ssh <spark-user>@<spark-2-ip> 'docker ps'` (looking for parakeet-asr and possibly magpie-tts)
3. Then create the repo structure on my laptop at `~/Projects/spark-control/`
4. Then propose the design for `models.yaml` and the swap script before implementing
Ask me anything that's unclear before starting.
+21
View File
@@ -0,0 +1,21 @@
FROM python:3.12-slim
RUN apt-get update \
&& apt-get install -y --no-install-recommends openssh-client curl ca-certificates \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY pyproject.toml /app/
COPY app /app/app
COPY entrypoint.sh /app/entrypoint.sh
RUN chmod +x /app/entrypoint.sh
# models.yaml is mounted in or copied at build via build-context root.
# For Docker, build from the repo root: `docker build -f image/Dockerfile .`
COPY models.yaml /app/models.yaml
RUN pip install --no-cache-dir -e .
ENV BIND_PORT=9999
EXPOSE 9999
ENTRYPOINT ["/app/entrypoint.sh"]
View File
+58
View File
@@ -0,0 +1,58 @@
from __future__ import annotations
import os
from dataclasses import dataclass
from pathlib import Path
def _env(name: str, default: str = "") -> str:
return os.environ.get(name, default)
def _resolve_models_yaml() -> str:
if env := os.environ.get("MODELS_YAML"):
return env
here = Path(__file__).resolve().parent # app/
candidates = [
here.parent / "models.yaml", # image/models.yaml (Docker)
here.parent.parent / "models.yaml", # <repo>/models.yaml (dev)
Path("/app/models.yaml"), # explicit container path
]
for p in candidates:
if p.exists():
return str(p)
return str(candidates[0]) # let load fail with a clear path
@dataclass(frozen=True)
class Settings:
spark1_host: str
spark1_user: str
spark2_host: str
spark2_user: str
ssh_key_path: str
ssh_known_hosts: str
models_yaml: str
vllm_port: int
parakeet_port: int
magpie_port: int
bind_port: int
@classmethod
def from_env(cls) -> "Settings":
return cls(
spark1_host=_env("SPARK1_HOST"),
spark1_user=_env("SPARK1_USER", "<spark-user>"),
spark2_host=_env("SPARK2_HOST"),
spark2_user=_env("SPARK2_USER", "<spark-user>"),
ssh_key_path=_env("SSH_KEY_PATH"),
ssh_known_hosts=_env("SSH_KNOWN_HOSTS"),
models_yaml=_resolve_models_yaml(),
vllm_port=int(_env("VLLM_PORT", "8888")),
parakeet_port=int(_env("PARAKEET_PORT", "8000")),
magpie_port=int(_env("MAGPIE_PORT", "9000")),
bind_port=int(_env("BIND_PORT", "9999")),
)
@property
def configured(self) -> bool:
return bool(self.spark1_host)
+43
View File
@@ -0,0 +1,43 @@
from __future__ import annotations
import httpx
from .config import Settings
_TIMEOUT = 3.0
async def check_vllm(settings: Settings) -> dict:
if not settings.spark1_host:
return {"ok": False, "error": "spark1 not configured"}
try:
async with httpx.AsyncClient(timeout=_TIMEOUT) as c:
r = await c.get(f"http://{settings.spark1_host}:{settings.vllm_port}/v1/models")
r.raise_for_status()
ids = [m["id"] for m in r.json().get("data", [])]
return {"ok": True, "current_model": ids[0] if ids else None, "all": ids}
except Exception as e:
return {"ok": False, "error": str(e)}
async def check_parakeet(settings: Settings) -> dict:
if not settings.spark2_host:
return {"ok": False, "error": "spark2 not configured"}
try:
async with httpx.AsyncClient(timeout=_TIMEOUT) as c:
r = await c.get(f"http://{settings.spark2_host}:{settings.parakeet_port}/health")
r.raise_for_status()
return {"ok": True, "detail": r.json()}
except Exception as e:
return {"ok": False, "error": str(e)}
async def check_magpie(settings: Settings) -> dict:
if not settings.spark2_host:
return {"ok": False, "error": "spark2 not configured"}
try:
async with httpx.AsyncClient(timeout=_TIMEOUT) as c:
r = await c.get(f"http://{settings.spark2_host}:{settings.magpie_port}/v1/health/ready")
r.raise_for_status()
return {"ok": True, "detail": r.json() if r.headers.get("content-type", "").startswith("application/json") else r.text}
except Exception as e:
return {"ok": False, "error": str(e)}
+40
View File
@@ -0,0 +1,40 @@
from __future__ import annotations
from typing import Literal
import yaml
from pydantic import BaseModel, Field
class ModelDef(BaseModel):
display_name: str
repo: str
size_gb: float
mode: Literal["solo", "cluster"]
capabilities: list[str] = Field(default_factory=list)
expected_ready_seconds: int = 300
vllm_args: list[str] = Field(default_factory=list)
class Defaults(BaseModel):
port: int = 8888
host: str = "0.0.0.0"
class Catalog(BaseModel):
defaults: Defaults = Field(default_factory=Defaults)
models: dict[str, ModelDef]
def load_catalog(path: str) -> Catalog:
with open(path) as f:
data = yaml.safe_load(f)
return Catalog.model_validate(data)
def build_launch_command(key: str, model: ModelDef, defaults: Defaults) -> str:
"""Return the shell command to launch `model` on Spark 1.
Assumes cwd will be `~/spark-vllm-docker` (we cd in the SSH wrapper).
"""
solo = "--solo " if model.mode == "solo" else ""
args = [f"--port={defaults.port}", f"--host={defaults.host}", *model.vllm_args]
return f"./launch-cluster.sh {solo}-d exec vllm serve {model.repo} {' '.join(args)}"
+155
View File
@@ -0,0 +1,155 @@
from __future__ import annotations
import asyncio
import json
from pathlib import Path
from fastapi import FastAPI, HTTPException
from fastapi.responses import FileResponse, JSONResponse, StreamingResponse
from fastapi.staticfiles import StaticFiles
from pydantic import BaseModel
from .config import Settings
from .health import check_magpie, check_parakeet, check_vllm
from .models import load_catalog
from .ssh import ssh_run
from .swap import SwapManager
settings = Settings.from_env()
catalog = load_catalog(settings.models_yaml)
swap_manager = SwapManager(settings, catalog)
app = FastAPI(title="spark-control", version="0.1.0")
_STATIC_DIR = Path(__file__).resolve().parent / "static"
app.mount("/static", StaticFiles(directory=_STATIC_DIR), name="static")
@app.get("/", include_in_schema=False)
async def index() -> FileResponse:
return FileResponse(_STATIC_DIR / "index.html")
@app.get("/api/config")
async def get_config() -> dict:
return {
"configured": settings.configured,
"spark1_host": settings.spark1_host,
"spark2_host": settings.spark2_host,
"vllm_port": settings.vllm_port,
}
@app.get("/api/models")
async def get_models() -> dict:
return {
"defaults": catalog.defaults.model_dump(),
"models": {k: v.model_dump() for k, v in catalog.models.items()},
}
@app.get("/api/status")
async def get_status() -> dict:
vllm, parakeet, magpie = await asyncio.gather(
check_vllm(settings),
check_parakeet(settings),
check_magpie(settings),
)
current_key = _identify_current_model(vllm.get("current_model"))
return {
"configured": settings.configured,
"vllm": vllm,
"parakeet": parakeet,
"magpie": magpie,
"current_model_key": current_key,
"current_swap_job": swap_manager.current_job_id,
}
def _identify_current_model(repo: str | None) -> str | None:
if not repo:
return None
for key, m in catalog.models.items():
if m.repo == repo:
return key
return None
class SwapRequest(BaseModel):
model_key: str
dry_run: bool = False
@app.post("/api/swap")
async def post_swap(req: SwapRequest) -> dict:
if not settings.configured and not req.dry_run:
raise HTTPException(503, "spark1 not configured")
try:
job = await swap_manager.trigger(req.model_key, dry_run=req.dry_run)
except KeyError:
raise HTTPException(404, f"unknown model: {req.model_key}")
except RuntimeError as e:
raise HTTPException(409, str(e))
return {"job_id": job.id, "model_key": job.model_key, "state": job.state}
@app.get("/api/swap/{job_id}")
async def get_swap(job_id: str) -> dict:
job = swap_manager.get(job_id)
if job is None:
raise HTTPException(404, "no such job")
return {
"id": job.id,
"model_key": job.model_key,
"state": job.state,
"started_at": job.started_at,
"finished_at": job.finished_at,
"returncode": job.returncode,
"dry_run": job.dry_run,
"lines": job.lines,
}
@app.get("/api/swap/{job_id}/stream")
async def stream_swap(job_id: str):
job = swap_manager.get(job_id)
if job is None:
raise HTTPException(404, "no such job")
async def gen():
sent = 0
while True:
n = len(job.lines)
if n > sent:
for line in job.lines[sent:n]:
payload = json.dumps({"line": line, "state": job.state})
yield f"data: {payload}\n\n"
sent = n
if job.returncode is not None and sent >= len(job.lines):
payload = json.dumps({
"state": job.state,
"returncode": job.returncode,
"finished_at": job.finished_at,
})
yield f"event: done\ndata: {payload}\n\n"
return
await asyncio.sleep(0.4)
return StreamingResponse(gen(), media_type="text/event-stream")
@app.post("/api/test-connection")
async def test_connection() -> dict:
"""Probe both Sparks with a `hostname` command. Useful for the StartOS setup flow."""
results: dict[str, dict] = {}
if settings.spark1_host:
rc, out, err = await ssh_run(settings.spark1_host, settings.spark1_user, "hostname && docker ps --format '{{.Names}}'", settings, timeout=10)
results["spark1"] = {"ok": rc == 0, "rc": rc, "stdout": out.strip(), "stderr": err.strip()}
else:
results["spark1"] = {"ok": False, "error": "not configured"}
if settings.spark2_host:
rc, out, err = await ssh_run(settings.spark2_host, settings.spark2_user, "hostname && docker ps --format '{{.Names}}'", settings, timeout=10)
results["spark2"] = {"ok": rc == 0, "rc": rc, "stdout": out.strip(), "stderr": err.strip()}
else:
results["spark2"] = {"ok": False, "error": "not configured"}
return results
+91
View File
@@ -0,0 +1,91 @@
"""Async wrappers around the system `ssh` client.
We shell out rather than use Paramiko/asyncssh so that:
- Host key + auth behavior is identical to what a user would see at the shell.
- The same ssh config file (`~/.ssh/config`) and key files work in dev.
- We don't pull in a heavy crypto dependency for the container image.
"""
from __future__ import annotations
import asyncio
from typing import AsyncIterator
from .config import Settings
def _base_args(settings: Settings) -> list[str]:
args = [
"ssh",
"-o", "BatchMode=yes",
"-o", "StrictHostKeyChecking=accept-new",
"-o", "ServerAliveInterval=15",
"-o", "ServerAliveCountMax=4",
]
if settings.ssh_key_path:
args += ["-i", settings.ssh_key_path]
if settings.ssh_known_hosts:
args += ["-o", f"UserKnownHostsFile={settings.ssh_known_hosts}"]
return args
async def ssh_run(
host: str,
user: str,
command: str,
settings: Settings,
timeout: float = 30.0,
) -> tuple[int, str, str]:
"""Run a one-shot SSH command. Returns (rc, stdout, stderr)."""
args = _base_args(settings) + [f"{user}@{host}", command]
proc = await asyncio.create_subprocess_exec(
*args,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
try:
stdout_b, stderr_b = await asyncio.wait_for(proc.communicate(), timeout=timeout)
except asyncio.TimeoutError:
proc.kill()
await proc.wait()
return 124, "", f"timeout after {timeout}s"
assert proc.returncode is not None
return proc.returncode, stdout_b.decode(errors="replace"), stderr_b.decode(errors="replace")
class StreamHandle:
"""Holds the final returncode once an `ssh_stream()` generator completes."""
def __init__(self) -> None:
self.returncode: int | None = None
async def ssh_stream(
host: str,
user: str,
command: str,
settings: Settings,
handle: StreamHandle | None = None,
) -> AsyncIterator[str]:
"""Yield stdout (and merged stderr) lines from a long-running SSH command.
The generator may be aborted by closing it (e.g. `break` in `async for`);
the child SSH process is terminated and waited on in the `finally` block.
"""
args = _base_args(settings) + [f"{user}@{host}", command]
proc = await asyncio.create_subprocess_exec(
*args,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.STDOUT,
)
assert proc.stdout is not None
try:
async for raw in proc.stdout:
yield raw.decode(errors="replace").rstrip("\r\n")
finally:
if proc.returncode is None:
proc.terminate()
try:
await asyncio.wait_for(proc.wait(), timeout=5)
except asyncio.TimeoutError:
proc.kill()
await proc.wait()
if handle is not None:
handle.returncode = proc.returncode
+195
View File
@@ -0,0 +1,195 @@
// spark-control front-end
// - polls /api/status every 5s for current model + health
// - lists models from /api/models as cards
// - POST /api/swap to start a swap, then opens SSE /api/swap/{id}/stream
const state = {
models: {},
defaults: {},
current_model_key: null,
swap_job_id: null,
swap_eventsource: null,
configured: true,
};
function el(sel) { return document.querySelector(sel); }
function $(sel) { return document.querySelectorAll(sel); }
async function fetchJSON(url, opts) {
const r = await fetch(url, opts);
if (!r.ok) {
const text = await r.text().catch(() => "");
throw new Error(`${r.status} ${r.statusText}: ${text}`);
}
return r.json();
}
function renderCards() {
const root = el("#cards");
root.innerHTML = "";
const keys = Object.keys(state.models);
for (const key of keys) {
const m = state.models[key];
const isActive = key === state.current_model_key;
const isSwapping = !!state.swap_job_id;
const card = document.createElement("div");
card.className = "card" + (isActive ? " active" : "");
card.innerHTML = `
<div class="name">${m.display_name}</div>
<div class="meta">
<span class="tag mode-${m.mode}">${m.mode}</span>
<span class="tag">${m.size_gb} GB</span>
${(m.capabilities || []).map(c => `<span class="tag cap">${c}</span>`).join("")}
</div>
<div class="muted small" style="word-break:break-all">${m.repo}</div>
<div class="spacer"></div>
<button class="btn ${isActive ? "" : "primary"}" data-key="${key}" ${isActive || isSwapping ? "disabled" : ""}>
${isActive ? "Current" : "Switch to this"}
</button>
`;
root.appendChild(card);
}
for (const btn of $(".card .btn")) {
btn.addEventListener("click", () => triggerSwap(btn.dataset.key));
}
}
function renderCurrent(status) {
const c = el("#current");
if (!status.configured) {
c.innerHTML = `<span class="muted">not configured</span>`;
return;
}
if (status.current_swap_job) {
c.innerHTML = `<span class="muted">swap in progress</span>`;
return;
}
const v = status.vllm || {};
if (!v.ok) {
c.innerHTML = `<span class="muted">vLLM unreachable</span>`;
return;
}
const key = status.current_model_key;
const m = key ? state.models[key] : null;
const label = m ? m.display_name : (v.current_model || "(unknown)");
c.innerHTML = `<strong>${label}</strong>`;
}
function renderHealth(status) {
function setDot(id, ok) {
const item = el(id);
if (!item) return;
const dot = item.querySelector(".dot");
dot.classList.remove("ok", "bad", "warn");
if (ok === true) dot.classList.add("ok");
else if (ok === false) dot.classList.add("bad");
else dot.classList.add("warn");
item.title = JSON.stringify(status[id.replace("#h-", "")] || {}, null, 2);
}
setDot("#h-vllm", status.vllm && status.vllm.ok);
setDot("#h-parakeet", status.parakeet && status.parakeet.ok);
setDot("#h-magpie", status.magpie && status.magpie.ok);
el("#updated").textContent = `updated ${new Date().toLocaleTimeString()}`;
}
function renderBanner(status) {
el("#setup-banner").classList.toggle("hidden", !!status.configured);
}
async function pollStatus() {
try {
const status = await fetchJSON("/api/status");
state.current_model_key = status.current_model_key;
state.configured = status.configured;
renderBanner(status);
renderCurrent(status);
renderHealth(status);
if (status.current_swap_job && status.current_swap_job !== state.swap_job_id) {
attachToSwap(status.current_swap_job);
} else if (!status.current_swap_job && state.swap_job_id && !state.swap_eventsource) {
// someone else's swap finished; clear local
state.swap_job_id = null;
el("#swap-panel").classList.add("hidden");
}
renderCards();
} catch (e) {
console.error("status poll failed", e);
}
}
async function loadModels() {
const data = await fetchJSON("/api/models");
state.defaults = data.defaults || {};
state.models = data.models || {};
}
async function triggerSwap(modelKey) {
if (state.swap_job_id) return;
try {
const r = await fetchJSON("/api/swap", {
method: "POST",
headers: { "content-type": "application/json" },
body: JSON.stringify({ model_key: modelKey }),
});
attachToSwap(r.job_id);
} catch (e) {
alert("Failed to start swap: " + e.message);
}
}
function attachToSwap(jobId) {
if (state.swap_eventsource) {
state.swap_eventsource.close();
state.swap_eventsource = null;
}
state.swap_job_id = jobId;
el("#swap-panel").classList.remove("hidden");
el("#swap-log").textContent = "";
el("#swap-state").textContent = "starting";
const es = new EventSource(`/api/swap/${jobId}/stream`);
state.swap_eventsource = es;
es.onmessage = (ev) => {
try {
const d = JSON.parse(ev.data);
if (d.state) el("#swap-state").textContent = d.state;
if (d.line) appendLog(d.line);
} catch {}
};
es.addEventListener("done", (ev) => {
try {
const d = JSON.parse(ev.data);
el("#swap-state").textContent = d.state + ` (rc=${d.returncode})`;
} catch {}
es.close();
state.swap_eventsource = null;
state.swap_job_id = null;
setTimeout(() => {
el("#swap-panel").classList.add("hidden");
pollStatus();
}, 4000);
pollStatus();
});
es.onerror = () => {
// SSE drops happen on tab background; reconnect on next poll
es.close();
state.swap_eventsource = null;
};
renderCards();
}
function appendLog(line) {
const log = el("#swap-log");
log.textContent += line + "\n";
log.scrollTop = log.scrollHeight;
}
async function init() {
await loadModels();
await pollStatus();
setInterval(pollStatus, 5000);
}
init();
+51
View File
@@ -0,0 +1,51 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, viewport-fit=cover">
<meta name="color-scheme" content="dark">
<title>spark-control</title>
<link rel="stylesheet" href="/static/style.css">
</head>
<body>
<header class="topbar">
<div class="brand">
<span class="logo-dot"></span>
<span>spark-control</span>
</div>
<div class="current" id="current">
<span class="muted">connecting…</span>
</div>
</header>
<main>
<section id="setup-banner" class="banner hidden">
<strong>Configuration needed.</strong>
<span>Run the <em>Configure Sparks</em> action in StartOS to set hostnames, then run <em>Test Connection</em>.</span>
</section>
<section id="swap-panel" class="swap-panel hidden">
<div class="swap-header">
<span class="spinner"></span>
<span id="swap-title">Swapping…</span>
<span class="spacer"></span>
<span class="muted small" id="swap-state"></span>
</div>
<pre id="swap-log" class="log"></pre>
</section>
<section id="cards" class="cards"></section>
<footer class="footer">
<div class="health">
<span class="health-item" id="h-vllm"><span class="dot"></span> vLLM</span>
<span class="health-item" id="h-parakeet"><span class="dot"></span> Parakeet</span>
<span class="health-item" id="h-magpie"><span class="dot"></span> Magpie</span>
</div>
<div class="muted small" id="updated"></div>
</footer>
</main>
<script src="/static/app.js"></script>
</body>
</html>
+170
View File
@@ -0,0 +1,170 @@
:root {
--bg: #0a0a0d;
--surface: #15151a;
--surface-2: #1c1c22;
--border: #25252c;
--text: #e6e6ea;
--muted: #7e7e8a;
--accent: #4ade80;
--warn: #f59e0b;
--error: #ef4444;
--info: #60a5fa;
--radius: 10px;
}
* { box-sizing: border-box; }
html, body { margin: 0; padding: 0; }
body {
background: var(--bg);
color: var(--text);
font: 15px/1.5 -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif;
min-height: 100vh;
-webkit-font-smoothing: antialiased;
}
.muted { color: var(--muted); }
.small { font-size: 13px; }
.hidden { display: none !important; }
.spacer { flex: 1; }
.topbar {
position: sticky;
top: 0;
background: rgba(10, 10, 13, 0.85);
backdrop-filter: saturate(160%) blur(10px);
-webkit-backdrop-filter: saturate(160%) blur(10px);
border-bottom: 1px solid var(--border);
display: flex;
align-items: center;
gap: 16px;
padding: 12px 20px;
z-index: 10;
}
.brand { display: flex; align-items: center; gap: 10px; font-weight: 600; }
.logo-dot { width: 10px; height: 10px; border-radius: 50%; background: var(--accent); box-shadow: 0 0 12px var(--accent); }
.current { flex: 1; text-align: right; font-size: 14px; }
.current strong { color: var(--accent); }
main {
max-width: 880px;
margin: 0 auto;
padding: 24px 20px 80px;
}
.banner {
background: var(--surface);
border: 1px solid var(--warn);
color: var(--warn);
padding: 12px 16px;
border-radius: var(--radius);
margin-bottom: 16px;
font-size: 14px;
}
.banner em { font-style: normal; background: rgba(245, 158, 11, 0.15); padding: 2px 6px; border-radius: 4px; }
.swap-panel {
background: var(--surface);
border: 1px solid var(--info);
border-radius: var(--radius);
padding: 14px 16px;
margin-bottom: 20px;
}
.swap-header { display: flex; align-items: center; gap: 10px; }
.swap-header #swap-title { font-weight: 600; color: var(--info); }
.spinner {
width: 14px; height: 14px;
border: 2px solid var(--info);
border-right-color: transparent;
border-radius: 50%;
animation: spin 0.8s linear infinite;
}
@keyframes spin { to { transform: rotate(360deg); } }
.log {
background: #08080b;
border: 1px solid var(--border);
border-radius: 6px;
padding: 10px 12px;
margin: 10px 0 0;
font: 12px/1.55 ui-monospace, SFMono-Regular, "SF Mono", Menlo, monospace;
color: #c7c7d1;
max-height: 280px;
overflow: auto;
white-space: pre-wrap;
word-break: break-word;
}
.cards {
display: grid;
gap: 14px;
grid-template-columns: repeat(auto-fill, minmax(280px, 1fr));
}
.card {
background: var(--surface);
border: 1px solid var(--border);
border-radius: var(--radius);
padding: 16px;
display: flex;
flex-direction: column;
gap: 12px;
transition: border-color 0.15s, transform 0.15s;
}
.card.active {
border-color: var(--accent);
box-shadow: 0 0 0 1px var(--accent) inset, 0 0 24px rgba(74, 222, 128, 0.08);
}
.card .name { font-weight: 600; font-size: 15px; }
.card .meta { display: flex; flex-wrap: wrap; gap: 6px; font-size: 12px; color: var(--muted); }
.tag {
background: var(--surface-2);
border: 1px solid var(--border);
padding: 2px 8px;
border-radius: 999px;
font-size: 11px;
}
.tag.mode-cluster { color: var(--info); border-color: rgba(96, 165, 250, 0.4); }
.tag.mode-solo { color: var(--accent); border-color: rgba(74, 222, 128, 0.4); }
.tag.cap { color: var(--muted); }
.btn {
appearance: none;
border: 1px solid var(--border);
background: var(--surface-2);
color: var(--text);
padding: 8px 14px;
border-radius: 8px;
cursor: pointer;
font: inherit;
font-weight: 500;
transition: background 0.15s, border-color 0.15s, opacity 0.15s;
}
.btn:hover:not(:disabled) { background: #24242c; border-color: #34343c; }
.btn.primary { background: var(--accent); color: #052e16; border-color: var(--accent); }
.btn.primary:hover:not(:disabled) { background: #6ee19a; }
.btn:disabled { opacity: 0.45; cursor: not-allowed; }
.card.active .btn { background: rgba(74, 222, 128, 0.12); color: var(--accent); border-color: rgba(74, 222, 128, 0.4); }
.footer {
margin-top: 28px;
padding-top: 16px;
border-top: 1px solid var(--border);
display: flex;
align-items: center;
gap: 14px;
flex-wrap: wrap;
}
.health { display: flex; gap: 14px; flex-wrap: wrap; }
.health-item { display: inline-flex; align-items: center; gap: 6px; font-size: 13px; color: var(--muted); }
.dot { width: 9px; height: 9px; border-radius: 50%; background: var(--muted); display: inline-block; }
.dot.ok { background: var(--accent); box-shadow: 0 0 8px rgba(74, 222, 128, 0.7); }
.dot.bad { background: var(--error); box-shadow: 0 0 8px rgba(239, 68, 68, 0.7); }
.dot.warn { background: var(--warn); }
@media (max-width: 640px) {
.topbar { padding: 10px 14px; }
main { padding: 16px 14px 80px; }
.cards { grid-template-columns: 1fr; }
}
+140
View File
@@ -0,0 +1,140 @@
from __future__ import annotations
import asyncio
import uuid
from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Optional
from .config import Settings
from .models import Catalog, build_launch_command
from .ssh import ssh_run, ssh_stream, StreamHandle
READY_MARKER = "Application startup complete."
MAX_LINES = 500
@dataclass
class SwapJob:
id: str
model_key: str
started_at: str
state: str = "starting" # starting|stopping|launching|tailing|ready|failed
lines: list[str] = field(default_factory=list)
returncode: Optional[int] = None
finished_at: Optional[str] = None
dry_run: bool = False
def append(self, line: str) -> None:
self.lines.append(line)
if len(self.lines) > MAX_LINES:
del self.lines[: len(self.lines) - MAX_LINES]
class SwapManager:
def __init__(self, settings: Settings, catalog: Catalog) -> None:
self.settings = settings
self.catalog = catalog
self.lock = asyncio.Lock()
self.jobs: dict[str, SwapJob] = {}
self.current_job_id: Optional[str] = None
def get(self, job_id: str) -> SwapJob | None:
return self.jobs.get(job_id)
def reload_catalog(self, catalog: Catalog) -> None:
self.catalog = catalog
async def trigger(self, model_key: str, *, dry_run: bool = False) -> SwapJob:
if model_key not in self.catalog.models:
raise KeyError(model_key)
if self.lock.locked():
raise RuntimeError("A swap is already in progress")
job = SwapJob(
id=uuid.uuid4().hex[:8],
model_key=model_key,
started_at=datetime.now(timezone.utc).isoformat(),
dry_run=dry_run,
)
self.jobs[job.id] = job
self.current_job_id = job.id
asyncio.create_task(self._run(job))
return job
async def _run(self, job: SwapJob) -> None:
async with self.lock:
try:
await self._do(job)
if job.state != "failed":
job.state = "ready"
job.returncode = 0
except Exception as e:
job.append(f"[error] {type(e).__name__}: {e}")
job.state = "failed"
if job.returncode is None:
job.returncode = 1
finally:
job.finished_at = datetime.now(timezone.utc).isoformat()
if self.current_job_id == job.id:
self.current_job_id = None
async def _do(self, job: SwapJob) -> None:
model = self.catalog.models[job.model_key]
s = self.settings
# Step 1: stop
job.state = "stopping"
stop_cmd = "cd ~/spark-vllm-docker && ./launch-cluster.sh stop"
job.append(f"$ {stop_cmd}")
if not job.dry_run:
rc, out, err = await ssh_run(s.spark1_host, s.spark1_user, stop_cmd, s, timeout=180)
for line in (out + err).splitlines():
job.append(line)
if rc != 0:
job.returncode = rc
job.state = "failed"
return
# Step 2: launch
job.state = "launching"
launch = build_launch_command(job.model_key, model, self.catalog.defaults)
launch_cmd = f"cd ~/spark-vllm-docker && {launch}"
job.append(f"$ {launch_cmd}")
if job.dry_run:
return
rc, out, err = await ssh_run(s.spark1_host, s.spark1_user, launch_cmd, s, timeout=60)
for line in (out + err).splitlines():
job.append(line)
if rc != 0:
job.returncode = rc
job.state = "failed"
return
# Step 3: tail logs until the ready marker (or timeout)
job.state = "tailing"
tail_cmd = "docker logs -f --tail 50 vllm_node"
job.append(f"$ {tail_cmd}")
timeout = max(model.expected_ready_seconds * 2, 600)
handle = StreamHandle()
loop = asyncio.get_event_loop()
deadline = loop.time() + timeout
ready = False
async def _tail() -> bool:
async for line in ssh_stream(s.spark1_host, s.spark1_user, tail_cmd, s, handle=handle):
job.append(line)
if READY_MARKER in line:
return True
if loop.time() > deadline:
return False
return False
try:
ready = await asyncio.wait_for(_tail(), timeout=timeout + 30)
except asyncio.TimeoutError:
ready = False
if not ready:
job.append(f"[error] did not see '{READY_MARKER}' within {timeout}s")
job.state = "failed"
job.returncode = 124
+20
View File
@@ -0,0 +1,20 @@
#!/bin/sh
set -eu
# Persist ssh state on the StartOS volume (mounted at /data when packaged).
SSH_DIR="${SSH_DIR:-/data/ssh}"
mkdir -p "$SSH_DIR"
chmod 700 "$SSH_DIR"
if [ ! -f "$SSH_DIR/id_ed25519" ]; then
echo "[entrypoint] Generating ed25519 keypair for SSH to Sparks..."
ssh-keygen -t ed25519 -N "" -f "$SSH_DIR/id_ed25519" -C "spark-control@start9" >/dev/null
fi
chmod 600 "$SSH_DIR/id_ed25519"
chmod 644 "$SSH_DIR/id_ed25519.pub"
touch "$SSH_DIR/known_hosts" && chmod 600 "$SSH_DIR/known_hosts"
export SSH_KEY_PATH="${SSH_KEY_PATH:-$SSH_DIR/id_ed25519}"
export SSH_KNOWN_HOSTS="${SSH_KNOWN_HOSTS:-$SSH_DIR/known_hosts}"
exec uvicorn app.server:app --host 0.0.0.0 --port "${BIND_PORT:-9999}"
+22
View File
@@ -0,0 +1,22 @@
[project]
name = "spark-control"
version = "0.1.0"
description = "Web UI to swap vLLM models on a DGX Spark cluster"
requires-python = ">=3.11"
dependencies = [
"fastapi>=0.115",
"uvicorn[standard]>=0.32",
"pydantic>=2.9",
"pyyaml>=6.0",
"httpx>=0.27",
]
[build-system]
requires = ["setuptools>=68"]
build-backend = "setuptools.build_meta"
[tool.setuptools]
packages = ["app"]
[tool.setuptools.package-data]
app = ["static/*", "../models.yaml"]
+40
View File
@@ -0,0 +1,40 @@
# Known issues
## magpie-tts crash loop (Spark 2)
The `magpie-tts` container at `nvcr.io/nim/nvidia/magpie-tts-multilingual:latest` is in a restart loop and `:9000` is not reachable. **Status as of 2026-05-12: unfixed. UI surfaces a red dot.**
**Root cause (from `docker logs magpie-tts`):**
```
nimlib.exceptions.ManifestDownloadError: Error downloading manifest:
I/O error Permission denied (os error 13)
```
The container exits 1 from `nimutils.download_models()` when fetching `nim/nvidia/magpie-tts-multilingual` model files from NGC. The "permission denied" is a local filesystem error — the container can't write the model cache where it expects to.
**To diagnose further:**
```bash
ssh <spark-user>@<spark-2-ip>
docker inspect magpie-tts | jq '.[].HostConfig.Mounts, .[].Config.Env'
# Look for: the mount path for the model cache, and whether NGC_API_KEY is set.
```
**Likely fixes (untried):**
1. Chown the bind-mounted cache directory on Spark 2 to the UID the container runs as.
2. Set an `NGC_API_KEY` env var (NIM containers need this for non-public artifacts).
3. Confirm there's free disk space.
## Qwen3.6-35B-A3B `--moe_backend=flashinfer_cutlass` may fail on launch
This flag is Blackwell-specific. If vLLM in the container reports `unrecognized arguments: --moe_backend` or similar, edit `models.yaml` for `qwen36` and drop that flag. The swap UI does NOT auto-fallback in v0.1 — failure surfaces in the log stream.
## Two SSH paths to Spark 1 from the laptop
`ssh <spark-user>@<spark-1-ip>` does NOT work from the laptop because the NVIDIA Sync ssh_config only has a Host entry for `<spark-1-host>.local`. Always use the `.local` hostname or `<spark-2-ip>`-style entries that ARE matched.
## Older models in `models.yaml`
The `qwen3-235b-fp8` and `qwen25-72b` catalog entries are conservative guesses for vLLM flags — they're on disk but were never the focus of this project. First launch of either may fail or be suboptimal; capture working flags here.
+80
View File
@@ -0,0 +1,80 @@
# spark-control model catalog
#
# Edit this file (or override at runtime via the StartOS "Edit Model Catalog"
# action) to add or change available models.
#
# Each model entry produces this command on Spark 1:
# cd ~/spark-vllm-docker
# ./launch-cluster.sh [--solo] -d exec vllm serve <repo> \
# --port=<defaults.port> --host=<defaults.host> <vllm_args...>
defaults:
port: 8888
host: 0.0.0.0
models:
qwen3-vl:
display_name: "Qwen3-VL 235B (vision)"
repo: RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4
size_gb: 135
mode: cluster
capabilities: [vision, multilingual]
expected_ready_seconds: 300
vllm_args:
- --gpu-memory-utilization=0.7
- -tp=2
- --distributed-executor-backend=ray
- --max-model-len=32768
gemma4:
display_name: "Gemma 4 31B"
repo: RedHatAI/gemma-4-31B-it-NVFP4
size_gb: 23
mode: solo
capabilities: [vision, reasoning, tools]
expected_ready_seconds: 240
vllm_args:
- --gpu-memory-utilization=0.8
- --max-model-len=32768
- --reasoning-parser=gemma4
- --tool-call-parser=gemma4
- --enable-auto-tool-choice
qwen36:
display_name: "Qwen3.6 35B-A3B (daily driver)"
repo: RedHatAI/Qwen3.6-35B-A3B-NVFP4
size_gb: 20
mode: solo
capabilities: [reasoning]
expected_ready_seconds: 300
vllm_args:
- --gpu-memory-utilization=0.85
- --max-model-len=65536
- --reasoning-parser=qwen3
- --moe_backend=flashinfer_cutlass
qwen3-235b-fp8:
display_name: "Qwen3 235B-A22B FP8 (legacy)"
repo: Qwen/Qwen3-235B-A22B-FP8
size_gb: 220
mode: cluster
capabilities: []
expected_ready_seconds: 360
vllm_args:
- --gpu-memory-utilization=0.7
- -tp=2
- --distributed-executor-backend=ray
- --max-model-len=32768
qwen25-72b:
display_name: "Qwen2.5 72B (legacy)"
repo: Qwen/Qwen2.5-72B-Instruct
size_gb: 145
mode: cluster
capabilities: []
expected_ready_seconds: 360
vllm_args:
- --gpu-memory-utilization=0.7
- -tp=2
- --distributed-executor-backend=ray
- --max-model-len=32768
+61
View File
@@ -0,0 +1,61 @@
# spark-control runbook
Operating notes for running and maintaining the cluster via spark-control.
## Day-to-day
- The UI lives at `http://<your-start9>.local:9999` once the StartOS package is installed and configured.
- Status auto-refreshes every 5 s.
- A swap takes 36 minutes depending on the model. Don't close the tab — but if you do, the swap continues; reopen and you'll re-attach to the log stream.
## Adding a new model
1. Add an entry to `models.yaml` (in the image source) or, post-install, via the "Edit Model Catalog" action in StartOS.
2. Confirm the weights are on the Spark: `ssh <spark-user>@<spark-1-host>.local 'ls ~/.cache/huggingface/hub/'`. If not, download with `./hf-download.sh <repo>` on Spark 1.
3. The new model appears in the UI on next refresh.
## Manual swap fallback
If the UI is unavailable and you need to swap by hand:
```bash
ssh <spark-user>@<spark-1-host>.local
cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh --solo -d exec vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
--port 8888 --host 0.0.0.0 --gpu-memory-utilization 0.8 \
--max-model-len 32768 --reasoning-parser gemma4 \
--tool-call-parser gemma4 --enable-auto-tool-choice
docker logs -f vllm_node # wait for "Application startup complete."
```
## Diagnostics
```bash
# Is vLLM serving?
curl -s http://<spark-1-ip>:8888/v1/models | jq .
# Cluster status (containers up?)
ssh <spark-user>@<spark-1-host>.local 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
# Tail current model's logs
ssh <spark-user>@<spark-1-host>.local 'docker logs --tail 200 -f vllm_node'
# Parakeet
curl -s http://<spark-2-ip>:8000/health
# Magpie (see known-issues.md)
curl -s http://<spark-2-ip>:9000/v1/health/ready
```
## Hard reset
If launch-cluster.sh gets stuck:
```bash
ssh <spark-user>@<spark-1-host>.local
cd ~/spark-vllm-docker
./launch-cluster.sh stop
docker ps -aq | xargs -r docker rm -f
# then relaunch your preferred model
```