Files
spark-control/claude-code-starter-prompt.md
T
Grant ae8efa1754 Initial scaffold: image/ FastAPI app, models.yaml, docs
- image/ FastAPI app: /api/status, /api/swap, /api/swap/{id}/stream, /api/test-connection
- models.yaml: 5-model catalog (qwen3-vl, gemma4, qwen36, qwen3-235b-fp8, qwen25-72b)
- README, runbook, known-issues
- Dry-run swap verified against live Spark 1 (gemma4 currently loaded)
2026-05-12 09:29:13 -05:00

12 KiB

Project: spark-control — Model switcher web UI for dual DGX Spark cluster

Goal

I want to build a small web service that gives me a browser-based interface to:

  1. See which LLM is currently loaded on my DGX Spark cluster
  2. Click a button to swap to a different model
  3. See real-time status as the swap progresses (stop → launch → ready)
  4. See basic health info about supporting services (Parakeet STT, eventually Magpie TTS)

The UI should live at a stable URL on my LAN so I can bookmark it. I'll likely access it from my laptop and phone.

Where this project lives

This repo lives on my laptop (macOS). The Sparks are servers — we control them remotely over SSH. Claude Code runs on my laptop, makes edits in the local repo, and executes commands on the Sparks via SSH.

The web UI itself, when deployed, will run on Spark 1 (where it can directly invoke launch-cluster.sh), but development happens on my laptop. We'll deploy the code to Spark 1 via rsync or scp or git pull as needed.

SSH setup

From my laptop I can SSH to either Spark directly:

ssh <spark-user>@<spark-1-ip>   # Spark 1
ssh <spark-user>@<spark-2-ip>    # Spark 2

(I can also use SSH key auth — set up earlier.)

When you need to run a command on a Spark, use this pattern:

ssh <spark-user>@<spark-1-ip> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'

For multi-line commands or scripts, you can pipe a heredoc or just SSH in directly and run them interactively. Either works — but always tell me what you're about to run so I can review.

For file transfers between my laptop and the Sparks, use rsync:

rsync -avz ~/Projects/spark-control/ <spark-user>@<spark-1-ip>:~/spark-control/

My hardware and what's running

Two NVIDIA DGX Spark units networked together:

  • Spark 1 — hostname <spark-1-host>, LAN IP <spark-1-ip>, QSFP IP <spark-1-qsfp-ip>. Head node for the vLLM cluster.
  • Spark 2 — hostname <spark-2-host>, LAN IP <spark-2-ip>, QSFP IP <spark-2-qsfp-ip>. Worker node for vLLM cluster, also hosts standalone services.

Both run Ubuntu 24.04, NVIDIA driver 580.x, CUDA 13.0, Docker, and have 128 GB unified memory each. They share a QSFP cable for high-speed (200 Gb/s) inter-node networking.

Passwordless SSH works in both directions via ~/.ssh/<ssh-key> key. My Linux username on both machines is <spark-user>.

Currently running:

  • One LLM at a time on the cluster (via the eugr/spark-vllm-docker project — see below)
  • parakeet-asr Docker container on Spark 2 (port 8000) — running 24/7 for speech-to-text, healthy for weeks
  • magpie-tts Docker container on Spark 2 (port 9000) — was being set up; I'm not 100% sure of its current state; first task is to verify
  • Open WebUI runs on a separate Start9 server on the LAN (not on the Sparks), accessing the LLM via HTTP

The LLM cluster: how it works

I use the eugr/spark-vllm-docker community project (cloned to ~/spark-vllm-docker on Spark 1). It manages a Ray-based vLLM cluster across both Sparks, with a wrapper script called launch-cluster.sh that handles starting/stopping Docker containers on both nodes.

Key commands (all run from ~/spark-vllm-docker on Spark 1):

  • ./launch-cluster.sh status — see what's running on both nodes
  • ./launch-cluster.sh stop — stop the cluster
  • ./launch-cluster.sh -d exec vllm serve ... — launch in daemon mode with vLLM args
  • ./launch-cluster.sh --solo -d exec vllm serve ... — same but only on Spark 1 (for smaller models)
  • docker logs -f vllm_node — tail vLLM logs

Container names: vllm_node (the main vLLM container), ray_head and ray_worker (Ray cluster), plus support containers.

The vLLM server binds to port 8888 and exposes an OpenAI-compatible API at http://<spark-1-ip>:8888/v1.

Models I have on disk (both Sparks)

All weights live in ~/.cache/huggingface/hub/ on each Spark:

  1. RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 (~135 GB) — flagship MoE, runs across both Sparks (-tp 2), has vision capability. Use for: maximum quality, vision input, multilingual.

  2. RedHatAI/gemma-4-31B-it-NVFP4 (~23 GB) — runs solo on Spark 1, has vision, has thinking-mode reasoning. Use for: math/reasoning-heavy tasks. Has a known vLLM Triton-attention slowdown bug (~15-20 tok/s vs theoretical 30-40).

  3. RedHatAI/Qwen3.6-35B-A3B-NVFP4 (~20 GB) — newer-generation Qwen MoE (35B total / 3B active), runs solo on Spark 1, expected to be the fastest (~70-100 tok/s) and my new daily driver. Note: this may still be downloading or may not be downloaded yet — first task is to verify and download if needed.

Exact launch commands for each model

These are the commands my system needs to run when I click a swap button.

Qwen3-VL-235B (uses both Sparks)

cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh -d exec vllm serve \
  RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 \
  --port 8888 --host 0.0.0.0 \
  --gpu-memory-utilization 0.7 \
  -tp 2 \
  --distributed-executor-backend ray \
  --max-model-len 32768

Expected ready time: ~3-5 min after stop completes.

Gemma 4 31B (solo on Spark 1)

cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh --solo -d exec vllm serve \
  RedHatAI/gemma-4-31B-it-NVFP4 \
  --port 8888 --host 0.0.0.0 \
  --gpu-memory-utilization 0.8 \
  --max-model-len 32768 \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --enable-auto-tool-choice

Expected ready time: ~3-4 min.

Qwen3.6-35B-A3B (solo on Spark 1) — new daily driver

cd ~/spark-vllm-docker
./launch-cluster.sh stop
./launch-cluster.sh --solo -d exec vllm serve \
  RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
  --port 8888 --host 0.0.0.0 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 65536 \
  --reasoning-parser qwen3 \
  --moe_backend flashinfer_cutlass

Expected ready time: ~3-5 min.

Note: the --moe_backend flashinfer_cutlass flag is Blackwell-specific. If it errors on launch, fallback is to remove that flag.

Common operations

  • Stop everything: ./launch-cluster.sh stop
  • Status check: ./launch-cluster.sh status
  • See vLLM logs: docker logs vllm_node (add -f to follow)
  • Hard reset if stuck: ./launch-cluster.sh stop && docker ps -aq | xargs -r docker rm -f
  • Health check (is API responding?): curl -s http://<spark-1-ip>:8888/v1/models

"Ready" signal

The model is ready to serve when docker logs vllm_node contains the line Application startup complete. Until then, it's still loading weights or compiling CUDA graphs.

Supporting services on Spark 2 (always-on, separate from cluster)

These don't get touched by model swaps:

  • parakeet-asr — STT on port 8000. Already running 24/7. Verify with curl http://<spark-2-ip>:8000/health which should return {"status":"ready",...}.
  • magpie-tts — TTS on port 9000. May or may not be running; verify with docker ps on Spark 2 and curl http://<spark-2-ip>:9000/v1/health/ready.

What I want you to build

Phase 1: Set up the project repo (start here)

Create a Git repo at ~/Projects/spark-control/ on my laptop. Initial structure:

spark-control/
├── README.md
├── models.yaml              # Declarative config for each model
├── scripts/
│   ├── swap-model.sh        # Universal swap script
│   ├── status.sh            # Cluster + service status
│   └── health.sh            # Health checks for everything
├── web-ui/
│   ├── server.py            # FastAPI backend
│   ├── static/
│   │   ├── index.html       # Toggle UI
│   │   ├── style.css
│   │   └── app.js           # State management, polling
│   └── requirements.txt
├── runbook.md               # Operating notes
└── known-issues.md          # Gotchas, troubleshooting

Phase 2: Build the universal swap script

scripts/swap-model.sh <model-key> should:

  1. Read the launch command from models.yaml by key (e.g. qwen3-vl, gemma4, qwen36)
  2. Stop the current cluster (via SSH to Spark 1)
  3. Run the new launch command (via SSH to Spark 1)
  4. Tail logs until "Application startup complete" appears or a timeout (~10 min) hits
  5. Return exit code 0 on success, non-zero on failure

Two versions might be useful:

  • The version that runs on my laptop — wraps everything in ssh <spark-user>@<spark-1-ip> ...
  • A simpler version that lives on Spark 1 — runs commands directly without SSH (used by the deployed web UI)

You can either share one script with a --remote flag, or make them two distinct files. Your call — propose the cleaner option.

Phase 3: Build the web UI

FastAPI backend that:

  • GET /api/status → JSON with {current_model, ready, parakeet_health, magpie_health, last_swap_time}
  • POST /api/swap with {model_key} → starts swap, returns swap job ID
  • GET /api/swap/{job_id}/stream → Server-Sent Events streaming swap progress
  • GET / → serves the HTML UI

Frontend should:

  • Show a card per model with a "Switch to this" button
  • Highlight which model is currently loaded
  • During a swap, show streaming log output and a spinner
  • Show a green/red indicator for Parakeet and Magpie health
  • Auto-refresh every 5 seconds

Keep the UI simple, clean, dark-themed. No frameworks needed — vanilla HTML/JS is fine.

Phase 4: Deploy and make it persistent

The web UI runs on Spark 1 so it can directly invoke launch-cluster.sh without SSH overhead. To deploy:

  1. rsync the project code from my laptop to ~/spark-control/ on Spark 1
  2. Set up a Python virtual environment on Spark 1 and install requirements
  3. Create a systemd service file that starts the FastAPI server on boot
  4. Service should listen on 0.0.0.0:9999 so I can hit it from any device on my LAN
  5. Add a simple deploy script (scripts/deploy.sh) on my laptop that does the rsync + restart in one command for future iteration

Working style

  • Before making changes that affect the running cluster, please ask me first.
  • When you write commands you want me to run, give them in clearly marked code blocks.
  • Distinguish clearly when a command is meant to run on my laptop vs. on a Spark (which means via SSH).
  • If you need information about the current state of the Sparks, ask me to run a diagnostic SSH command and paste the output — or run it yourself if you have shell access.
  • Test things incrementally. Don't build the whole UI before validating the swap script works.
  • I'm a layman — explain technical decisions briefly in plain English when they involve trade-offs.
  • When making changes that modify files on a Spark, do them by editing in my laptop's repo first and then deploying — not by editing on the Spark directly. That keeps my laptop as the source of truth.

First task

  1. First, verify SSH access to both Sparks from my laptop:
    • ssh <spark-user>@<spark-1-ip> hostname should return <spark-1-host>
    • ssh <spark-user>@<spark-2-ip> hostname should return <spark-2-host>
  2. Then verify the current state of the cluster via SSH:
    • Confirm ~/spark-vllm-docker exists on Spark 1 and launch-cluster.sh is there: ssh <spark-user>@<spark-1-ip> 'ls ~/spark-vllm-docker/launch-cluster.sh'
    • Check which LLM (if any) is currently loaded: ssh <spark-user>@<spark-1-ip> 'cd ~/spark-vllm-docker && ./launch-cluster.sh status' and ssh <spark-user>@<spark-1-ip> 'curl -s http://localhost:8888/v1/models'
    • Verify which models are downloaded: ssh <spark-user>@<spark-1-ip> 'ls ~/.cache/huggingface/hub/ | grep -iE "qwen|gemma"'
    • Specifically check if Qwen3.6-35B-A3B-NVFP4 is downloaded; if not, that's the prerequisite step (run the hf-download.sh command on Spark 1)
    • Check what's running on Spark 2: ssh <spark-user>@<spark-2-ip> 'docker ps' (looking for parakeet-asr and possibly magpie-tts)
  3. Then create the repo structure on my laptop at ~/Projects/spark-control/
  4. Then propose the design for models.yaml and the swap script before implementing

Ask me anything that's unclear before starting.