87334f85f0
- models.yaml: add 'description' field for all 5 models (generic, anyone-can-use) - ModelDef gains optional description: str | None field - UI: render description below meta tags; mute the repo line further - escapeHtml() for safety in case descriptions/names contain HTML chars - Update runbook: how to add a new model with description
90 lines
4.2 KiB
Markdown
90 lines
4.2 KiB
Markdown
# spark-control runbook
|
||
|
||
Operating notes for running and maintaining the cluster via spark-control.
|
||
|
||
## Prerequisites (per Spark)
|
||
|
||
spark-control is a **controller**, not a runtime. Each Spark in your cluster must already have the upstream `eugr/spark-vllm-docker` project set up:
|
||
|
||
1. Clone `https://github.com/eugr/spark-vllm-docker` to `~/spark-vllm-docker` on Spark 1 (the head node).
|
||
2. Build the vLLM container: `./build-and-copy.sh -c` (on a cluster) or `./build-and-copy.sh` (solo).
|
||
3. Pre-download any models you want in the catalog: `./hf-download.sh <repo> -c --copy-parallel`.
|
||
4. Verify: `./launch-cluster.sh status` returns sensibly.
|
||
5. Set up passwordless SSH from your Start9 server's spark-control container to each Spark (use the Show Public Key action — see README.md "Post-install setup").
|
||
|
||
Sharing this package with someone else who has a similar dual-DGX-Spark setup: they do the same per-Spark prerequisites, then sideload the `.s9pk` on their Start9 and run the setup actions.
|
||
|
||
## Recent successful swaps
|
||
|
||
- **2026-05-12 — gemma4 → qwen36** via `POST /api/swap` from laptop dev server. ~5:30 to "Application startup complete." Inference works (`/v1/chat/completions` returns reasoning content via `reasoning` field). `--moe_backend=flashinfer_cutlass` confirmed valid by vLLM (logged "Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend").
|
||
|
||
## Optimization flags (added 2026-05-12)
|
||
|
||
Aligned `gemma4` and `qwen36` `vllm_args` with the project's sibling recipes (`qwen3.5-35b-a3b-fp8.yaml`, `gemma4-26b-a4b.yaml`):
|
||
|
||
- `--load-format=fastsafetensors` — faster cold-start weight load.
|
||
- `--enable-prefix-caching` — reuse cached prefix tokens (e.g. system prompt) across requests.
|
||
- `--kv-cache-dtype=fp8` — store KV cache in 8-bit FP; halves memory used per active context.
|
||
|
||
These take effect on the **next swap to that model**. If a swap fails after this change with errors mentioning fastsafetensors/prefix-caching/fp8, revert the entry in `models.yaml` and retry.
|
||
|
||
## Day-to-day
|
||
|
||
- The UI lives at `http://<your-start9>.local:9999` once the StartOS package is installed and configured.
|
||
- Status auto-refreshes every 5 s.
|
||
- A swap takes 3–6 minutes depending on the model. Don't close the tab — but if you do, the swap continues; reopen and you'll re-attach to the log stream.
|
||
|
||
## Adding a new model
|
||
|
||
1. Add an entry to `image/models.yaml`. Required fields: `display_name`, `repo`, `size_gb`, `mode` (`solo` or `cluster`), `vllm_args`. Optional but recommended: `description` (one paragraph — what the model is, what it's good for, how it differs from others; renders below the meta tags in each card), `capabilities` (tags like `[vision, reasoning, tools]`), `expected_ready_seconds`.
|
||
2. Confirm the weights are on the Spark: `ssh <spark-user>@<spark-1-host>.local 'ls ~/.cache/huggingface/hub/'`. If not, download with `./hf-download.sh <repo>` on Spark 1.
|
||
3. Rebuild + redeploy the package: `cd package && make x86 && make install`.
|
||
|
||
If `description` is omitted, the card simply hides that section — no need to populate it for every model. Keep descriptions generic (not user-specific) so the catalog stays portable.
|
||
|
||
## Manual swap fallback
|
||
|
||
If the UI is unavailable and you need to swap by hand:
|
||
|
||
```bash
|
||
ssh <spark-user>@<spark-1-host>.local
|
||
cd ~/spark-vllm-docker
|
||
./launch-cluster.sh stop
|
||
./launch-cluster.sh --solo -d exec vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
|
||
--port 8888 --host 0.0.0.0 --gpu-memory-utilization 0.8 \
|
||
--max-model-len 32768 --reasoning-parser gemma4 \
|
||
--tool-call-parser gemma4 --enable-auto-tool-choice
|
||
docker logs -f vllm_node # wait for "Application startup complete."
|
||
```
|
||
|
||
## Diagnostics
|
||
|
||
```bash
|
||
# Is vLLM serving?
|
||
curl -s http://<spark-1-ip>:8888/v1/models | jq .
|
||
|
||
# Cluster status (containers up?)
|
||
ssh <spark-user>@<spark-1-host>.local 'cd ~/spark-vllm-docker && ./launch-cluster.sh status'
|
||
|
||
# Tail current model's logs
|
||
ssh <spark-user>@<spark-1-host>.local 'docker logs --tail 200 -f vllm_node'
|
||
|
||
# Parakeet
|
||
curl -s http://<spark-2-ip>:8000/health
|
||
|
||
# Magpie (see known-issues.md)
|
||
curl -s http://<spark-2-ip>:9000/v1/health/ready
|
||
```
|
||
|
||
## Hard reset
|
||
|
||
If launch-cluster.sh gets stuck:
|
||
|
||
```bash
|
||
ssh <spark-user>@<spark-1-host>.local
|
||
cd ~/spark-vllm-docker
|
||
./launch-cluster.sh stop
|
||
docker ps -aq | xargs -r docker rm -f
|
||
# then relaunch your preferred model
|
||
```
|