Files
Keysat 8d839e3714 v0.13.0:4 - redaction gateway, embeddings proxy, expanded audio API
- Add redaction gateway (redaction_gateway.py, redaction/ scrub + tests)
- Add embeddings proxy and spark_embed service (Dockerfile + main.py)
- Expand audio_proxy with speaker-aware handling; deep_health/health/server updates
- Package: configureSparks action + sparkConfig model updates, manifest/main wiring
- Docs: AUDIO_API, EMBEDDINGS, REDACTION_GATEWAY; HANDOFF and runbook/known-issues refresh
2026-06-11 17:45:57 -05:00

169 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Spark Control — handoff guide
You've received a `spark-control.s9pk` file. This guide gets you from "fresh install" to "working dashboard" in about an hour, most of which is waiting for downloads.
## What this is
Spark Control is a StartOS 0.4 package that runs on your Start9 server and gives you a browser dashboard for a **dual-DGX-Spark vLLM cluster**. From the dashboard you can:
- See which LLM is currently loaded
- Swap to a different LLM with one click (live log streaming until ready)
- Download new LLM weights from HuggingFace
- Install and monitor audio services (Parakeet STT, Kokoro TTS, Sortformer diarization)
- Expose OpenAI-compatible endpoints (`/v1/chat/completions`, `/v1/audio/transcriptions`, `/v1/audio/speech`, etc.) to other apps on your LAN through a single trusted host
It does **not** run any models itself — it's a controller. The actual GPU work happens on your two Sparks. Spark Control SSHes into Spark 1 to invoke `launch-cluster.sh`, and HTTP-polls both Sparks for health.
---
## Prerequisites before installing the s9pk
You need all of the following set up **first**. The s9pk assumes they exist.
### Hardware
- A **Start9 server** running StartOS 0.4.x with sideload-install enabled.
- **Two NVIDIA DGX Sparks** (or similar boxes with NVIDIA GPUs + Docker). One will be "Spark 1" (head node) and one will be "Spark 2" (worker node + audio services). They must be on the same LAN as the Start9 server.
### Spark 1 (the head node)
- A Linux user account you can SSH into (any username — `ubuntu`, `nvidia`, your own — just be consistent). Note the username; you'll enter it later.
- **Docker + NVIDIA Container Toolkit** installed and working.
- **`~/spark-vllm-docker/`** cloned from the community repo:
```bash
git clone https://github.com/eugr/spark-vllm-docker ~/spark-vllm-docker
cd ~/spark-vllm-docker
./build-and-copy.sh -c # builds the vLLM container image
```
> **The path matters.** Spark Control hardcodes `~/spark-vllm-docker` as the working directory for cluster commands. If you clone it elsewhere, the dashboard's swap and download actions will silently fail.
- A HuggingFace cache at `~/.cache/huggingface/hub/`. Either pre-download one model now, or use the dashboard's "Download a new model" button after install.
### Spark 2 (the worker node)
- Same Linux user account as Spark 1, with passwordless SSH from Spark 1 working.
- **Docker + NVIDIA Container Toolkit** installed.
- That's it — the rest can be installed through the Spark Control dashboard once it's running.
### Optional but recommended
- An **NVIDIA NGC personal API key** if you want to install Parakeet (STT) from `nvcr.io`. Free: <https://ngc.nvidia.com/setup/personal-key>. Starts with `nvapi-...`. (Not needed for Kokoro — it's Apache 2.0 and pulls from a public GitHub Container Registry image with no auth.)
---
## Install steps
### 1. Sideload the s9pk
In your Start9 web UI, go to **Sideload Service** and upload the `spark-control_*.s9pk` file (x86_64 or aarch64 depending on your Start9). Install it.
### 2. Start the service once
The first start generates an ed25519 SSH keypair inside the package volume. Wait until the service shows "Running" status — should take only a few seconds.
### 3. Show the public key and install it on both Sparks
- Open Spark Control → **Actions → Show Public Key**.
- If you haven't run Configure Sparks yet, you'll just see the raw key. Skip to step 4, then come back here.
- Once Configure Sparks is filled in, this action produces a **ready-to-paste install command** (a multi-line `ssh ... 'echo ... >> authorized_keys'` block). Copy the entire block.
- Run it in a terminal on a machine that already has SSH access to your Sparks. You'll be prompted for each Spark's SSH password once. After it completes, the Start9 server can SSH into both Sparks.
### 4. Configure Sparks
- Open Spark Control → **Actions → Configure Sparks**.
- Fill in:
- **Spark 1 hostname or IP** — prefer the **IP** (e.g. `192.168.1.x`) over `.local` hostnames; vLLM only binds IPv4 and mDNS can resolve to IPv6 first.
- **Spark 1 SSH user** — whatever username you set up on Spark 1.
- **Spark 2 hostname or IP** + **SSH user** — same idea.
- Optional Parakeet/Kokoro overrides — leave blank if those services run on Spark 2 (the normal case).
- Optional **Open WebUI URL** — paste your Open WebUI LAN URL to get a deep-link button in the dashboard next to the current model.
- Optional **NGC API key** — paste it here if you have one.
Save.
### 5. Re-run Show Public Key (if you skipped earlier)
Now that hosts are configured, Show Public Key will give you the paste-ready install command. Run it as described in step 3.
### 6. Open the Web UI
From the Spark Control service page, click the Web UI button. You should see:
- A **top status bar** with the currently loaded LLM (or "no model loaded" if Spark 1's vLLM container is fresh).
- An **LLM tab** with cards for each model in the bundled catalog. Models you've downloaded show "on disk" badges; others show "not downloaded".
- An **Audio / Speech tab** with health status and Install / Start / Stop / Restart buttons for Parakeet and Kokoro.
If the dashboard loads and both Spark hardware cards show CPU/RAM/GPU stats, **you're in**.
### 7. Load your first LLM
Click **"Switch to this"** on any model card. The dashboard will:
1. SSH into Spark 1, stop any running vLLM container.
2. Run `launch-cluster.sh` with the model's bundled flags.
3. Stream `docker logs -f` back to your browser until `Application startup complete.` appears.
4. Mark the new model as active.
Typical times: solo-mode models (Qwen3.6, Gemma 4) take ~35 min. Cluster-mode models (Qwen3-VL 235B) take ~58 min — they have to coordinate across both Sparks via Ray.
### 8. (Optional) install audio services
From the Audio / Speech tab, click **Install Parakeet**. This pulls and starts the parakeet-asr container on Spark 2 with appropriate settings. Takes ~23 min for the first install.
For diarization with speaker fingerprints, also click **Reapply patches** — that overlays Sortformer + TitaNet support onto the parakeet container. The patches survive `docker restart` but are wiped by `docker rm`; if you ever recreate the container, re-run Reapply patches.
Kokoro TTS is similar — pull `ghcr.io/remsky/kokoro-fastapi-gpu:latest` on Spark 2 and run with `--gpus all -p 8880:8880`. No NGC key required (Kokoro is Apache 2.0). Boots in ~5 seconds and uses only ~1.3 GB of GPU memory. (A one-click Kokoro install action is planned for a near-future release; for now you can install it manually or Spark Control will pick it up automatically once it's running on port 8880.)
---
## Endpoints exposed to your other apps
Once Spark Control is healthy, your other LAN apps can hit it as a single trusted backend:
| Path | Backend | Notes |
|---|---|---|
| `GET /api/endpoints` | (self) | Service discovery — JSON of base_urls + ready flags. Hit this first so you don't have to hardcode Spark IPs in other apps. |
| `POST /v1/chat/completions` | vLLM on Spark 1 | OpenAI-compatible; supports `stream: true` |
| `POST /v1/completions` | vLLM on Spark 1 | Legacy OpenAI completions |
| `POST /v1/audio/transcriptions` | Parakeet on Spark 2 | OpenAI-compatible STT |
| `POST /v1/audio/speech` | Kokoro on Spark 2 | OpenAI-compatible TTS. Default voice `bm_george`; pass `voice` to pick any of Kokoro's 67 voices. Reliable at any input length (no chunking/retry needed). |
| `POST /api/audio/diarize-chunk` | Sortformer + TitaNet | Per-chunk diarization with voice fingerprints for cross-chunk re-clustering |
| `POST /api/audio/transcribe-with-speakers` | Parakeet + Sortformer | One-shot transcribe + diarize, merged |
All of these inherit Spark Control's TLS cert and StartOS access controls. You only need one allowlist entry in downstream apps.
---
## Operational notes
- **vLLM does not auto-load a model after a power loss.** When your Sparks reboot, the dashboard will show "no model loaded" — you click "Switch to this" on whichever LLM you want. Parakeet/Kokoro auto-restart with their containers (Kokoro is `--restart unless-stopped` and Parakeet runs the same way).
- **Single-slot chunked workflows.** If you're calling `/v1/audio/transcriptions` or `/api/audio/diarize-chunk` in chunked workflows, send chunks **sequentially**, not in parallel. Parallel requests can trigger a known cuFFT race on the Spark 2 GPU that returns a 503 + Retry-After. Spark Control recovers automatically but each retry costs ~60s.
- **Context window**: the bundled Qwen3.6 entry runs at 64K total tokens (input + output combined). Adjust per-model via the Advanced button on each card.
- **Update path**: model-catalog overrides and custom services live in `/data/*` inside the volume; they survive s9pk updates.
---
## Resources
- `README.md` — repo overview, build instructions, dev environment
- `runbook.md` — model-swap recipes and operating notes
- `known-issues.md` — debugging fixes (Mamba block-size, vision token budget, historical Magpie notes, etc.)
- Source: `image/` is the FastAPI app; `package/` is the StartOS wrapper. The s9pk build is `cd package && make x86` (or `aarch64`).
---
## If you're an AI agent helping with this install
A few things worth knowing:
- The codebase is **two halves**: `image/` is a standalone FastAPI app you can run with `uvicorn app.server:app` for local dev. `package/` is the StartOS wrapper. Changes to either should be coordinated.
- **All connection info** comes from environment variables in `image/app/config.py`, populated from `package/startos/fileModels/sparkConfig.yaml.ts` via the Configure Sparks action. No IPs, usernames, or paths are hardcoded in runtime code.
- The **path `~/spark-vllm-docker`** *is* hardcoded in `swap.py`, `download.py`, `updates.py`, and `models.py`. If the user has cloned the upstream repo elsewhere, either fix the path or symlink it.
- **Persistent state** lives at `/data/` inside the container: `config.yaml`, `models-overrides.yaml`, `services-overrides.yaml`, `connectivity.json`, `ssh/`. These survive package updates.
- The dashboard polls every 5 s; check `image/app/health.py` and `image/app/connectivity.py` for the probing logic. External apps can also POST failures to `/api/health-event` to log between-poll blips.
- Debugging audio issues: SSH into Spark 2 and run `docker logs --tail 100 parakeet-asr`. cuFFT errors usually mean parallel requests; see the operational note above.
- Debugging LLM swaps: the swap log is streamed in the browser, but the underlying `docker logs -f vllm_node` on Spark 1 is the ground truth.
- The package supports both `x86_64` and `aarch64` builds. Match your Start9 server architecture.