spark-control/claude-code-starter-prompt.md

# Project: spark-control — Model switcher web UI for dual DGX Spark cluster

> **Update 2026-05-12 — Direction change:** the web UI is being built as a
> **StartOS 0.4 package** (sideloaded onto Alice's existing Start9 server),
> **not** as a FastAPI service running directly on Spark 1. The Start9 server
> shares a LAN with the Sparks and SSHes into Spark 1 to invoke
> `launch-cluster.sh`. StartOS handles `.local` exposure and HTTPS; SSH
> credentials live in a per-install config file managed by a "Configure Sparks"
> action. See <https://docs.start9.com/packaging/0.4.0.x/> for the packaging
> model. Repo layout:
>
> - `image/` — Docker image source (FastAPI app, runs anywhere with `uvicorn`).
> - `package/` — StartOS 0.4 wrapper (manifest, main, interfaces, actions).
>
> The "Phase 4: Deploy" section below (systemd on Spark 1) is **superseded** by
> the StartOS sideload workflow. Other phases (models.yaml schema, swap script,
> FastAPI endpoints, frontend) still apply but live inside `image/`.

## Goal

I want to build a small web service that gives me a browser-based interface to:

1. See which LLM is currently loaded on my DGX Spark cluster
2. Click a button to swap to a different model
3. See real-time status as the swap progresses (stop → launch → ready)
4. See basic health info about supporting services (Parakeet STT, eventually Magpie TTS)

The UI should live at a stable URL on my LAN so I can bookmark it. I'll likely access it from my laptop and phone.

## Where this project lives

This repo lives on **my laptop** (macOS). The Sparks are servers — we control them remotely over SSH. Claude Code runs on my laptop, makes edits in the local repo, and executes commands on the Sparks via SSH.

The web UI itself, when deployed, will run on **Spark 1** (where it can directly invoke `launch-cluster.sh`), but development happens on my laptop. We'll deploy the code to Spark 1 via `rsync` or `scp` or `git pull` as needed.

## SSH setup

From my laptop I can SSH to either Spark directly:

```bash
ssh <spark-user>@<spark-1-ip>   # Spark 1
ssh <spark-user>@<spark-2-ip>    # Spark 2
```

(I can also use SSH key auth — set up earlier.)

When you need to run a command on a Spark, use this pattern:

```bash
ssh <spark-user>@<spark-1-ip> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
```

For multi-line commands or scripts, you can pipe a heredoc or just SSH in directly and run them interactively. Either works — but always tell me what you're about to run so I can review.

For file transfers between my laptop and the Sparks, use `rsync`:

```bash
rsync -avz ~/Projects/spark-control/ <spark-user>@<spark-1-ip>:~/spark-control/
```

## My hardware and what's running

**Two NVIDIA DGX Spark units** networked together:

- **Spark 1** — hostname `<spark-1-host>`, LAN IP `<spark-1-ip>`, QSFP IP `<spark-1-qsfp-ip>`. Head node for the vLLM cluster.
- **Spark 2** — hostname `<spark-2-host>`, LAN IP `<spark-2-ip>`, QSFP IP `<spark-2-qsfp-ip>`. Worker node for vLLM cluster, also hosts standalone services.

Both run Ubuntu 24.04, NVIDIA driver 580.x, CUDA 13.0, Docker, and have 128 GB unified memory each. They share a QSFP cable for high-speed (200 Gb/s) inter-node networking.

Passwordless SSH works in both directions via `~/.ssh/<ssh-key>` key. My Linux username on both machines is `<spark-user>`.

**Currently running:**
- One LLM at a time on the cluster (via the `eugr/spark-vllm-docker` project — see below)
- `parakeet-asr` Docker container on Spark 2 (port 8000) — running 24/7 for speech-to-text, healthy for weeks
- `magpie-tts` Docker container on Spark 2 (port 9000) — was being set up; I'm not 100% sure of its current state; first task is to verify
- Open WebUI runs on a separate Start9 server on the LAN (not on the Sparks), accessing the LLM via HTTP

## The LLM cluster: how it works

I use the **`eugr/spark-vllm-docker`** community project (cloned to `~/spark-vllm-docker` on Spark 1). It manages a Ray-based vLLM cluster across both Sparks, with a wrapper script called `launch-cluster.sh` that handles starting/stopping Docker containers on both nodes.

Key commands (all run from `~/spark-vllm-docker` on Spark 1):
- `./launch-cluster.sh status` — see what's running on both nodes
- `./launch-cluster.sh stop` — stop the cluster
- `./launch-cluster.sh -d exec vllm serve ...` — launch in daemon mode with vLLM args
- `./launch-cluster.sh --solo -d exec vllm serve ...` — same but only on Spark 1 (for smaller models)
- `docker logs -f vllm_node` — tail vLLM logs

Container names: `vllm_node` (the main vLLM container), `ray_head` and `ray_worker` (Ray cluster), plus support containers.

The vLLM server binds to port **8888** and exposes an OpenAI-compatible API at `http://<spark-1-ip>:8888/v1`.

## Models I have on disk (both Sparks)

All weights live in `~/.cache/huggingface/hub/` on each Spark:

1. **`RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4`** (~135 GB) — flagship MoE, runs across both Sparks (-tp 2), has vision capability. Use for: maximum quality, vision input, multilingual.

2. **`RedHatAI/gemma-4-31B-it-NVFP4`** (~23 GB) — runs solo on Spark 1, has vision, has thinking-mode reasoning. Use for: math/reasoning-heavy tasks. Has a known vLLM Triton-attention slowdown bug (~15-20 tok/s vs theoretical 30-40).

3. **`RedHatAI/Qwen3.6-35B-A3B-NVFP4`** (~20 GB) — newer-generation Qwen MoE (35B total / 3B active), runs solo on Spark 1, expected to be the fastest (~70-100 tok/s) and my new daily driver. **Note: this may still be downloading or may not be downloaded yet — first task is to verify and download if needed.**

## Exact launch commands for each model

These are the commands my system needs to run when I click a swap button.

### Qwen3-VL-235B (uses both Sparks)
```bash
cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh -d exec vllm serve \
  RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 \
  --port 8888 --host 0.0.0.0 \
  --gpu-memory-utilization 0.7 \
  -tp 2 \
  --distributed-executor-backend ray \
  --max-model-len 32768
```
Expected ready time: ~3-5 min after stop completes.

### Gemma 4 31B (solo on Spark 1)
```bash
cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh --solo -d exec vllm serve \
  RedHatAI/gemma-4-31B-it-NVFP4 \
  --port 8888 --host 0.0.0.0 \
  --gpu-memory-utilization 0.8 \
  --max-model-len 32768 \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --enable-auto-tool-choice
```
Expected ready time: ~3-4 min.

### Qwen3.6-35B-A3B (solo on Spark 1) — new daily driver
```bash
cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh --solo -d exec vllm serve \
  RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
  --port 8888 --host 0.0.0.0 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 65536 \
  --reasoning-parser qwen3 \
  --moe_backend flashinfer_cutlass
```
Expected ready time: ~3-5 min.

Note: the `--moe_backend flashinfer_cutlass` flag is Blackwell-specific. If it errors on launch, fallback is to remove that flag.

### Common operations
- Stop everything: `./launch-cluster.sh stop`
- Status check: `./launch-cluster.sh status`
- See vLLM logs: `docker logs vllm_node` (add `-f` to follow)
- Hard reset if stuck: `./launch-cluster.sh stop && docker ps -aq | xargs -r docker rm -f`
- Health check (is API responding?): `curl -s http://<spark-1-ip>:8888/v1/models`

### "Ready" signal
The model is ready to serve when `docker logs vllm_node` contains the line `Application startup complete.` Until then, it's still loading weights or compiling CUDA graphs.

## Supporting services on Spark 2 (always-on, separate from cluster)

These don't get touched by model swaps:

- **`parakeet-asr`** — STT on port 8000. Already running 24/7. Verify with `curl http://<spark-2-ip>:8000/health` which should return `{"status":"ready",...}`.
- **`magpie-tts`** — TTS on port 9000. May or may not be running; verify with `docker ps` on Spark 2 and `curl http://<spark-2-ip>:9000/v1/health/ready`.

## What I want you to build

### Phase 1: Set up the project repo (start here)

Create a Git repo at `~/Projects/spark-control/` on **my laptop**. Initial structure:

```
spark-control/
├── README.md
├── models.yaml              # Declarative config for each model
├── scripts/
│   ├── swap-model.sh        # Universal swap script
│   ├── status.sh            # Cluster + service status
│   └── health.sh            # Health checks for everything
├── web-ui/
│   ├── server.py            # FastAPI backend
│   ├── static/
│   │   ├── index.html       # Toggle UI
│   │   ├── style.css
│   │   └── app.js           # State management, polling
│   └── requirements.txt
├── runbook.md               # Operating notes
└── known-issues.md          # Gotchas, troubleshooting
```

### Phase 2: Build the universal swap script

`scripts/swap-model.sh <model-key>` should:
1. Read the launch command from `models.yaml` by key (e.g. `qwen3-vl`, `gemma4`, `qwen36`)
2. Stop the current cluster (via SSH to Spark 1)
3. Run the new launch command (via SSH to Spark 1)
4. Tail logs until "Application startup complete" appears or a timeout (~10 min) hits
5. Return exit code 0 on success, non-zero on failure

Two versions might be useful:
- The version that runs on **my laptop** — wraps everything in `ssh <spark-user>@<spark-1-ip> ...`
- A simpler version that lives on **Spark 1** — runs commands directly without SSH (used by the deployed web UI)

You can either share one script with a `--remote` flag, or make them two distinct files. Your call — propose the cleaner option.

### Phase 3: Build the web UI

FastAPI backend that:
- `GET /api/status` → JSON with `{current_model, ready, parakeet_health, magpie_health, last_swap_time}`
- `POST /api/swap` with `{model_key}` → starts swap, returns swap job ID
- `GET /api/swap/{job_id}/stream` → Server-Sent Events streaming swap progress
- `GET /` → serves the HTML UI

Frontend should:
- Show a card per model with a "Switch to this" button
- Highlight which model is currently loaded
- During a swap, show streaming log output and a spinner
- Show a green/red indicator for Parakeet and Magpie health
- Auto-refresh every 5 seconds

Keep the UI simple, clean, dark-themed. No frameworks needed — vanilla HTML/JS is fine.

### Phase 4: Deploy and make it persistent

The web UI runs on **Spark 1** so it can directly invoke `launch-cluster.sh` without SSH overhead. To deploy:

1. `rsync` the project code from my laptop to `~/spark-control/` on Spark 1
2. Set up a Python virtual environment on Spark 1 and install requirements
3. Create a systemd service file that starts the FastAPI server on boot
4. Service should listen on `0.0.0.0:9999` so I can hit it from any device on my LAN
5. Add a simple deploy script (`scripts/deploy.sh`) on my laptop that does the rsync + restart in one command for future iteration

## Working style

- Before making changes that affect the running cluster, please ask me first.
- When you write commands you want me to run, give them in clearly marked code blocks.
- Distinguish clearly when a command is meant to run on my laptop vs. on a Spark (which means via SSH).
- If you need information about the current state of the Sparks, ask me to run a diagnostic SSH command and paste the output — or run it yourself if you have shell access.
- Test things incrementally. Don't build the whole UI before validating the swap script works.
- I'm a layman — explain technical decisions briefly in plain English when they involve trade-offs.
- When making changes that modify files on a Spark, do them by editing in my laptop's repo first and then deploying — not by editing on the Spark directly. That keeps my laptop as the source of truth.

## First task

1. First, **verify SSH access to both Sparks** from my laptop:
   - `ssh <spark-user>@<spark-1-ip> hostname` should return `<spark-1-host>`
   - `ssh <spark-user>@<spark-2-ip> hostname` should return `<spark-2-host>`
2. Then **verify the current state of the cluster** via SSH:
   - Confirm `~/spark-vllm-docker` exists on Spark 1 and `launch-cluster.sh` is there: `ssh <spark-user>@<spark-1-ip> 'ls ~/spark-vllm-docker/launch-cluster.sh'`
   - Check which LLM (if any) is currently loaded: `ssh <spark-user>@<spark-1-ip> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'` and `ssh <spark-user>@<spark-1-ip> 'curl -s http://localhost:8888/v1/models'`
   - Verify which models are downloaded: `ssh <spark-user>@<spark-1-ip> 'ls ~/.cache/huggingface/hub/ | grep -iE "qwen|gemma"'`
   - Specifically check if `Qwen3.6-35B-A3B-NVFP4` is downloaded; if not, that's the prerequisite step (run the `hf-download.sh` command on Spark 1)
   - Check what's running on Spark 2: `ssh <spark-user>@<spark-2-ip> 'docker ps'` (looking for parakeet-asr and possibly magpie-tts)
3. Then create the repo structure on my laptop at `~/Projects/spark-control/`
4. Then propose the design for `models.yaml` and the swap script before implementing

Ask me anything that's unclear before starting.