342e150266
Aligned with sibling recipes in eugr/spark-vllm-docker. Applies on next swap to each model. First real swap gemma4 -> qwen36 succeeded in 5:30 with --moe_backend=flashinfer_cutlass.
261 lines
13 KiB
Markdown
261 lines
13 KiB
Markdown
# Project: spark-control — Model switcher web UI for dual DGX Spark cluster
|
|
|
|
> **Update 2026-05-12 — Direction change:** the web UI is being built as a
|
|
> **StartOS 0.4 package** (sideloaded onto Alice's existing Start9 server),
|
|
> **not** as a FastAPI service running directly on Spark 1. The Start9 server
|
|
> shares a LAN with the Sparks and SSHes into Spark 1 to invoke
|
|
> `launch-cluster.sh`. StartOS handles `.local` exposure and HTTPS; SSH
|
|
> credentials live in a per-install config file managed by a "Configure Sparks"
|
|
> action. See <https://docs.start9.com/packaging/0.4.0.x/> for the packaging
|
|
> model. Repo layout:
|
|
>
|
|
> - `image/` — Docker image source (FastAPI app, runs anywhere with `uvicorn`).
|
|
> - `package/` — StartOS 0.4 wrapper (manifest, main, interfaces, actions).
|
|
>
|
|
> The "Phase 4: Deploy" section below (systemd on Spark 1) is **superseded** by
|
|
> the StartOS sideload workflow. Other phases (models.yaml schema, swap script,
|
|
> FastAPI endpoints, frontend) still apply but live inside `image/`.
|
|
|
|
## Goal
|
|
|
|
I want to build a small web service that gives me a browser-based interface to:
|
|
|
|
1. See which LLM is currently loaded on my DGX Spark cluster
|
|
2. Click a button to swap to a different model
|
|
3. See real-time status as the swap progresses (stop → launch → ready)
|
|
4. See basic health info about supporting services (Parakeet STT, eventually Magpie TTS)
|
|
|
|
The UI should live at a stable URL on my LAN so I can bookmark it. I'll likely access it from my laptop and phone.
|
|
|
|
## Where this project lives
|
|
|
|
This repo lives on **my laptop** (macOS). The Sparks are servers — we control them remotely over SSH. Claude Code runs on my laptop, makes edits in the local repo, and executes commands on the Sparks via SSH.
|
|
|
|
The web UI itself, when deployed, will run on **Spark 1** (where it can directly invoke `launch-cluster.sh`), but development happens on my laptop. We'll deploy the code to Spark 1 via `rsync` or `scp` or `git pull` as needed.
|
|
|
|
## SSH setup
|
|
|
|
From my laptop I can SSH to either Spark directly:
|
|
|
|
```bash
|
|
ssh <spark-user>@<spark-1-ip> # Spark 1
|
|
ssh <spark-user>@<spark-2-ip> # Spark 2
|
|
```
|
|
|
|
(I can also use SSH key auth — set up earlier.)
|
|
|
|
When you need to run a command on a Spark, use this pattern:
|
|
|
|
```bash
|
|
ssh <spark-user>@<spark-1-ip> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
|
|
```
|
|
|
|
For multi-line commands or scripts, you can pipe a heredoc or just SSH in directly and run them interactively. Either works — but always tell me what you're about to run so I can review.
|
|
|
|
For file transfers between my laptop and the Sparks, use `rsync`:
|
|
|
|
```bash
|
|
rsync -avz ~/Projects/spark-control/ <spark-user>@<spark-1-ip>:~/spark-control/
|
|
```
|
|
|
|
## My hardware and what's running
|
|
|
|
**Two NVIDIA DGX Spark units** networked together:
|
|
|
|
- **Spark 1** — hostname `<spark-1-host>`, LAN IP `<spark-1-ip>`, QSFP IP `<spark-1-qsfp-ip>`. Head node for the vLLM cluster.
|
|
- **Spark 2** — hostname `<spark-2-host>`, LAN IP `<spark-2-ip>`, QSFP IP `<spark-2-qsfp-ip>`. Worker node for vLLM cluster, also hosts standalone services.
|
|
|
|
Both run Ubuntu 24.04, NVIDIA driver 580.x, CUDA 13.0, Docker, and have 128 GB unified memory each. They share a QSFP cable for high-speed (200 Gb/s) inter-node networking.
|
|
|
|
Passwordless SSH works in both directions via `~/.ssh/<ssh-key>` key. My Linux username on both machines is `<spark-user>`.
|
|
|
|
**Currently running:**
|
|
- One LLM at a time on the cluster (via the `eugr/spark-vllm-docker` project — see below)
|
|
- `parakeet-asr` Docker container on Spark 2 (port 8000) — running 24/7 for speech-to-text, healthy for weeks
|
|
- `magpie-tts` Docker container on Spark 2 (port 9000) — was being set up; I'm not 100% sure of its current state; first task is to verify
|
|
- Open WebUI runs on a separate Start9 server on the LAN (not on the Sparks), accessing the LLM via HTTP
|
|
|
|
## The LLM cluster: how it works
|
|
|
|
I use the **`eugr/spark-vllm-docker`** community project (cloned to `~/spark-vllm-docker` on Spark 1). It manages a Ray-based vLLM cluster across both Sparks, with a wrapper script called `launch-cluster.sh` that handles starting/stopping Docker containers on both nodes.
|
|
|
|
Key commands (all run from `~/spark-vllm-docker` on Spark 1):
|
|
- `./launch-cluster.sh status` — see what's running on both nodes
|
|
- `./launch-cluster.sh stop` — stop the cluster
|
|
- `./launch-cluster.sh -d exec vllm serve ...` — launch in daemon mode with vLLM args
|
|
- `./launch-cluster.sh --solo -d exec vllm serve ...` — same but only on Spark 1 (for smaller models)
|
|
- `docker logs -f vllm_node` — tail vLLM logs
|
|
|
|
Container names: `vllm_node` (the main vLLM container), `ray_head` and `ray_worker` (Ray cluster), plus support containers.
|
|
|
|
The vLLM server binds to port **8888** and exposes an OpenAI-compatible API at `http://<spark-1-ip>:8888/v1`.
|
|
|
|
## Models I have on disk (both Sparks)
|
|
|
|
All weights live in `~/.cache/huggingface/hub/` on each Spark:
|
|
|
|
1. **`RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4`** (~135 GB) — flagship MoE, runs across both Sparks (-tp 2), has vision capability. Use for: maximum quality, vision input, multilingual.
|
|
|
|
2. **`RedHatAI/gemma-4-31B-it-NVFP4`** (~23 GB) — runs solo on Spark 1, has vision, has thinking-mode reasoning. Use for: math/reasoning-heavy tasks. Has a known vLLM Triton-attention slowdown bug (~15-20 tok/s vs theoretical 30-40).
|
|
|
|
3. **`RedHatAI/Qwen3.6-35B-A3B-NVFP4`** (~20 GB) — newer-generation Qwen MoE (35B total / 3B active), runs solo on Spark 1, expected to be the fastest (~70-100 tok/s) and my new daily driver. **Note: this may still be downloading or may not be downloaded yet — first task is to verify and download if needed.**
|
|
|
|
## Exact launch commands for each model
|
|
|
|
These are the commands my system needs to run when I click a swap button.
|
|
|
|
### Qwen3-VL-235B (uses both Sparks)
|
|
```bash
|
|
cd ~/spark-vllm-docker
|
|
./launch-cluster.sh stop
|
|
./launch-cluster.sh -d exec vllm serve \
|
|
RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 \
|
|
--port 8888 --host 0.0.0.0 \
|
|
--gpu-memory-utilization 0.7 \
|
|
-tp 2 \
|
|
--distributed-executor-backend ray \
|
|
--max-model-len 32768
|
|
```
|
|
Expected ready time: ~3-5 min after stop completes.
|
|
|
|
### Gemma 4 31B (solo on Spark 1)
|
|
```bash
|
|
cd ~/spark-vllm-docker
|
|
./launch-cluster.sh stop
|
|
./launch-cluster.sh --solo -d exec vllm serve \
|
|
RedHatAI/gemma-4-31B-it-NVFP4 \
|
|
--port 8888 --host 0.0.0.0 \
|
|
--gpu-memory-utilization 0.8 \
|
|
--max-model-len 32768 \
|
|
--reasoning-parser gemma4 \
|
|
--tool-call-parser gemma4 \
|
|
--enable-auto-tool-choice
|
|
```
|
|
Expected ready time: ~3-4 min.
|
|
|
|
### Qwen3.6-35B-A3B (solo on Spark 1) — new daily driver
|
|
```bash
|
|
cd ~/spark-vllm-docker
|
|
./launch-cluster.sh stop
|
|
./launch-cluster.sh --solo -d exec vllm serve \
|
|
RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
|
|
--port 8888 --host 0.0.0.0 \
|
|
--gpu-memory-utilization 0.85 \
|
|
--max-model-len 65536 \
|
|
--reasoning-parser qwen3 \
|
|
--moe_backend flashinfer_cutlass
|
|
```
|
|
Expected ready time: ~3-5 min.
|
|
|
|
Note: the `--moe_backend flashinfer_cutlass` flag is Blackwell-specific. If it errors on launch, fallback is to remove that flag.
|
|
|
|
### Common operations
|
|
- Stop everything: `./launch-cluster.sh stop`
|
|
- Status check: `./launch-cluster.sh status`
|
|
- See vLLM logs: `docker logs vllm_node` (add `-f` to follow)
|
|
- Hard reset if stuck: `./launch-cluster.sh stop && docker ps -aq | xargs -r docker rm -f`
|
|
- Health check (is API responding?): `curl -s http://<spark-1-ip>:8888/v1/models`
|
|
|
|
### "Ready" signal
|
|
The model is ready to serve when `docker logs vllm_node` contains the line `Application startup complete.` Until then, it's still loading weights or compiling CUDA graphs.
|
|
|
|
## Supporting services on Spark 2 (always-on, separate from cluster)
|
|
|
|
These don't get touched by model swaps:
|
|
|
|
- **`parakeet-asr`** — STT on port 8000. Already running 24/7. Verify with `curl http://<spark-2-ip>:8000/health` which should return `{"status":"ready",...}`.
|
|
- **`magpie-tts`** — TTS on port 9000. May or may not be running; verify with `docker ps` on Spark 2 and `curl http://<spark-2-ip>:9000/v1/health/ready`.
|
|
|
|
## What I want you to build
|
|
|
|
### Phase 1: Set up the project repo (start here)
|
|
|
|
Create a Git repo at `~/Projects/spark-control/` on **my laptop**. Initial structure:
|
|
|
|
```
|
|
spark-control/
|
|
├── README.md
|
|
├── models.yaml # Declarative config for each model
|
|
├── scripts/
|
|
│ ├── swap-model.sh # Universal swap script
|
|
│ ├── status.sh # Cluster + service status
|
|
│ └── health.sh # Health checks for everything
|
|
├── web-ui/
|
|
│ ├── server.py # FastAPI backend
|
|
│ ├── static/
|
|
│ │ ├── index.html # Toggle UI
|
|
│ │ ├── style.css
|
|
│ │ └── app.js # State management, polling
|
|
│ └── requirements.txt
|
|
├── runbook.md # Operating notes
|
|
└── known-issues.md # Gotchas, troubleshooting
|
|
```
|
|
|
|
### Phase 2: Build the universal swap script
|
|
|
|
`scripts/swap-model.sh <model-key>` should:
|
|
1. Read the launch command from `models.yaml` by key (e.g. `qwen3-vl`, `gemma4`, `qwen36`)
|
|
2. Stop the current cluster (via SSH to Spark 1)
|
|
3. Run the new launch command (via SSH to Spark 1)
|
|
4. Tail logs until "Application startup complete" appears or a timeout (~10 min) hits
|
|
5. Return exit code 0 on success, non-zero on failure
|
|
|
|
Two versions might be useful:
|
|
- The version that runs on **my laptop** — wraps everything in `ssh <spark-user>@<spark-1-ip> ...`
|
|
- A simpler version that lives on **Spark 1** — runs commands directly without SSH (used by the deployed web UI)
|
|
|
|
You can either share one script with a `--remote` flag, or make them two distinct files. Your call — propose the cleaner option.
|
|
|
|
### Phase 3: Build the web UI
|
|
|
|
FastAPI backend that:
|
|
- `GET /api/status` → JSON with `{current_model, ready, parakeet_health, magpie_health, last_swap_time}`
|
|
- `POST /api/swap` with `{model_key}` → starts swap, returns swap job ID
|
|
- `GET /api/swap/{job_id}/stream` → Server-Sent Events streaming swap progress
|
|
- `GET /` → serves the HTML UI
|
|
|
|
Frontend should:
|
|
- Show a card per model with a "Switch to this" button
|
|
- Highlight which model is currently loaded
|
|
- During a swap, show streaming log output and a spinner
|
|
- Show a green/red indicator for Parakeet and Magpie health
|
|
- Auto-refresh every 5 seconds
|
|
|
|
Keep the UI simple, clean, dark-themed. No frameworks needed — vanilla HTML/JS is fine.
|
|
|
|
### Phase 4: Deploy and make it persistent
|
|
|
|
The web UI runs on **Spark 1** so it can directly invoke `launch-cluster.sh` without SSH overhead. To deploy:
|
|
|
|
1. `rsync` the project code from my laptop to `~/spark-control/` on Spark 1
|
|
2. Set up a Python virtual environment on Spark 1 and install requirements
|
|
3. Create a systemd service file that starts the FastAPI server on boot
|
|
4. Service should listen on `0.0.0.0:9999` so I can hit it from any device on my LAN
|
|
5. Add a simple deploy script (`scripts/deploy.sh`) on my laptop that does the rsync + restart in one command for future iteration
|
|
|
|
## Working style
|
|
|
|
- Before making changes that affect the running cluster, please ask me first.
|
|
- When you write commands you want me to run, give them in clearly marked code blocks.
|
|
- Distinguish clearly when a command is meant to run on my laptop vs. on a Spark (which means via SSH).
|
|
- If you need information about the current state of the Sparks, ask me to run a diagnostic SSH command and paste the output — or run it yourself if you have shell access.
|
|
- Test things incrementally. Don't build the whole UI before validating the swap script works.
|
|
- I'm a layman — explain technical decisions briefly in plain English when they involve trade-offs.
|
|
- When making changes that modify files on a Spark, do them by editing in my laptop's repo first and then deploying — not by editing on the Spark directly. That keeps my laptop as the source of truth.
|
|
|
|
## First task
|
|
|
|
1. First, **verify SSH access to both Sparks** from my laptop:
|
|
- `ssh <spark-user>@<spark-1-ip> hostname` should return `<spark-1-host>`
|
|
- `ssh <spark-user>@<spark-2-ip> hostname` should return `<spark-2-host>`
|
|
2. Then **verify the current state of the cluster** via SSH:
|
|
- Confirm `~/spark-vllm-docker` exists on Spark 1 and `launch-cluster.sh` is there: `ssh <spark-user>@<spark-1-ip> 'ls ~/spark-vllm-docker/launch-cluster.sh'`
|
|
- Check which LLM (if any) is currently loaded: `ssh <spark-user>@<spark-1-ip> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'` and `ssh <spark-user>@<spark-1-ip> 'curl -s http://localhost:8888/v1/models'`
|
|
- Verify which models are downloaded: `ssh <spark-user>@<spark-1-ip> 'ls ~/.cache/huggingface/hub/ | grep -iE "qwen|gemma"'`
|
|
- Specifically check if `Qwen3.6-35B-A3B-NVFP4` is downloaded; if not, that's the prerequisite step (run the `hf-download.sh` command on Spark 1)
|
|
- Check what's running on Spark 2: `ssh <spark-user>@<spark-2-ip> 'docker ps'` (looking for parakeet-asr and possibly magpie-tts)
|
|
3. Then create the repo structure on my laptop at `~/Projects/spark-control/`
|
|
4. Then propose the design for `models.yaml` and the swap script before implementing
|
|
|
|
Ask me anything that's unclear before starting.
|