Files
spark-control/README.md
T
Grant ee8c2406b8 v0.6.0 - Service-level connectivity tracking + passive failure-report endpoint
connectivity.py:
- Generalized 'spark' subject to any string; renamed 'spark' field to 'subject'
- Legacy v0.5 events with the old 'spark' field are migrated transparently on read (kind defaults to 'transition')
- New record_report(subject, ok, source, detail, latency_ms): always appends an event with kind='report'; does NOT mutate the current state (only active polling is authoritative)
- summary() returns events normalized to the new schema

Wiring:
- /api/status now calls record_state for vllm/parakeet/magpie (dedup on no-change)
- /api/services calls record_state for each service after its http check
- Result: dashboard observes service-level transitions automatically with no extra polling

Passive endpoint:
- POST /api/health-event with {service, ok, source?, error?, ms?}
- Useful for external apps (e.g. Open WebUI) to surface sub-poll-interval failures the dashboard would otherwise miss

UI:
- Connectivity dialog groups events by subject (hosts ordered first, then services)
- Per-subject summary shows transition count, down count, report count, failed-report count
- Transitions and reports render inline with distinct styling; reports show source app + error + latency
- Legacy v0.5 events render unchanged

Docs:
- README documents /api/health-event with a curl example

Package: bump to 0.6.0:0
2026-05-12 13:19:27 -05:00

119 lines
5.9 KiB
Markdown

# spark-control
A browser-based control panel for a dual-DGX-Spark vLLM cluster. Designed to run as a [StartOS 0.4](https://docs.start9.com/packaging/0.4.0.x/) package on a Start9 server on the same LAN as the Sparks.
## What it does
- Shows which LLM is currently loaded on the cluster (`:8888/v1/models`).
- Click to swap to a different model — stops the current one, launches the new one, streams logs to the UI until `Application startup complete.` appears.
- Surfaces health for Parakeet (STT, `:8000`) and Magpie (TTS, `:9000`) on Spark 2.
## Architecture
```
[Browser/phone] ──► [StartOS reverse proxy] ──► [spark-control container]
│ (SSH over LAN)
[Spark 1] ──► launch-cluster.sh
[Spark 2]
```
Two layers in this repo:
- `image/` — a self-contained FastAPI app + static UI. Runs anywhere with `uvicorn` and an SSH client. Useful for development.
- `package/` — a thin StartOS 0.4 wrapper that packages the image, exposes the UI on the LAN, and gives the user actions to configure SSH access to the Sparks.
## Quick start (local dev, no StartOS yet)
```bash
cd image
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
export SPARK1_HOST=<spark-1-ip>
export SPARK1_USER=<spark-user>
export SPARK2_HOST=<spark-2-ip>
export SPARK2_USER=<spark-user>
export SSH_KEY_PATH="$HOME/Library/Application Support/NVIDIA/Sync/config/nvsync.key"
uvicorn app.server:app --host 0.0.0.0 --port 9999 --reload
```
Open <http://localhost:9999>.
> **Note:** use the **IP** `<spark-1-ip>` for Spark 1, not `<spark-1-host>.local`. mDNS resolves to IPv6 first and `httpx` hangs on it because vLLM only binds IPv4.
## Build the StartOS package
```bash
cd package
npm i # one-time
make x86 # produces spark-control_x86_64.s9pk (~55 MB)
```
Requires [`start-cli`](https://docs.start9.com/latest/developer-guide/sdk/installing-the-sdk), Node ≥ 22, Docker. The build runs `tsc` + `ncc` for the TS bundle, then `docker build` on `image/Dockerfile`, then `start-cli s9pk pack` to produce the `.s9pk`.
To sideload onto your Start9: `make install` (needs `host:` set in `~/.startos/config.yaml`), or upload the `.s9pk` via the Start9 web UI's sideload feature.
## Post-install setup (one-time per Start9 install)
1. Open the Spark Control service → **Actions****Show Public Key** → copy the line.
2. SSH to each Spark and append the line to `~/.ssh/authorized_keys` for the `<spark-user>` user.
3. **Actions****Configure Sparks** → enter `<spark-1-ip>` / `<spark-user>` for Spark 1 and `<spark-2-ip>` / `<spark-user>` for Spark 2.
4. Start the service. Open the Web UI — current model + health should show within ~5 s.
## Repo layout
- `image/` — Docker image source (FastAPI app + `models.yaml`)
- `package/` — StartOS 0.4 package source
- `runbook.md` — operating notes
- `known-issues.md` — known quirks and workarounds
- `LICENSE` — MIT
## Service discovery API
Other services on your LAN can hit `GET /api/endpoints` to learn where the current model lives without hardcoding Spark IPs. Stable JSON shape:
```json
{
"vllm": { "ready": true, "base_url": "http://<spark-1-ip>:8888/v1", "model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4", "openai_compat": true },
"parakeet":{ "ready": true, "base_url": "http://<spark-2-ip>:8000", "kind": "stt", "model": "nvidia/parakeet-tdt-0.6b-v3" },
"magpie": { "ready": false, "base_url": "http://<spark-2-ip>:9000", "kind": "tts" }
}
```
`base_url` is filled in whenever Configure Sparks has been completed (even if the underlying service isn't currently up). Pair the URL with `ready: true` to safely route traffic.
## Reporting failures from external apps
Spark Control polls every 5 s, so a brief blip in Parakeet/Magpie/vLLM availability can slip between polls and never make it into the connectivity log. To capture short failures, an external app (e.g. Open WebUI) can POST whenever a call fails (or succeeds):
```bash
curl -X POST http://<dashboard-url>/api/health-event \
-H 'content-type: application/json' \
-d '{
"service": "parakeet",
"ok": false,
"source": "open-webui",
"error": "HTTP 503",
"ms": 420
}'
```
Fields: `service` (required), `ok` (required), `source` (optional, free-form), `error` (optional), `ms` (optional latency). Each POST appends a `report` event to the connectivity log alongside the polling-based transition events.
## Status
**v0.2.3** — installed and verified on a Start9 server. Five bundled LLMs in the catalog (qwen3-vl, gemma4, qwen36, qwen3-235b-fp8, qwen2.5-72b), plus any custom models added through the UI.
### What v0.2 added on top of v0.1
- **Service discovery API** (`/api/endpoints`) for other LAN services
- **Magpie crash fix** documented (chown the model-cache volume to uid 1000)
- **Always-on services panel** with Start/Stop/Restart for Parakeet + Magpie, plus per-service host configuration in Configure Sparks (so Parakeet/Magpie can live on Spark 1, Spark 2, or anywhere)
- **Model download** from the dashboard — paste an HF repo, pick solo or cluster, watch percent progress with bytes/rate/ETA. After completion, an "Add to catalog" dialog appears pre-filled.
- **spark-vllm-docker update check** — banner shows "N commits behind upstream"; Apply Update runs `git pull && ./build-and-copy.sh -c` over SSH with a streamed log
- **Per-model Advanced settings** — knobs for max context, GPU memory %, and three optimization toggles (fastsafetensors, prefix caching, FP8 KV cache). Persisted to `/data/models-overrides.yaml` so they survive package updates. Bundled and custom models alike.
v0.3+ roadmap (loose): richer dashboard (SSH/GPU/tokens-per-sec), Open WebUI deep-link integration, optional auth, multi-cluster.