Files
spark-control/HANDOFF.md
T
Keysat 8d839e3714 v0.13.0:4 - redaction gateway, embeddings proxy, expanded audio API
- Add redaction gateway (redaction_gateway.py, redaction/ scrub + tests)
- Add embeddings proxy and spark_embed service (Dockerfile + main.py)
- Expand audio_proxy with speaker-aware handling; deep_health/health/server updates
- Package: configureSparks action + sparkConfig model updates, manifest/main wiring
- Docs: AUDIO_API, EMBEDDINGS, REDACTION_GATEWAY; HANDOFF and runbook/known-issues refresh
2026-06-11 17:45:57 -05:00

11 KiB
Raw Permalink Blame History

Spark Control — handoff guide

You've received a spark-control.s9pk file. This guide gets you from "fresh install" to "working dashboard" in about an hour, most of which is waiting for downloads.

What this is

Spark Control is a StartOS 0.4 package that runs on your Start9 server and gives you a browser dashboard for a dual-DGX-Spark vLLM cluster. From the dashboard you can:

  • See which LLM is currently loaded
  • Swap to a different LLM with one click (live log streaming until ready)
  • Download new LLM weights from HuggingFace
  • Install and monitor audio services (Parakeet STT, Kokoro TTS, Sortformer diarization)
  • Expose OpenAI-compatible endpoints (/v1/chat/completions, /v1/audio/transcriptions, /v1/audio/speech, etc.) to other apps on your LAN through a single trusted host

It does not run any models itself — it's a controller. The actual GPU work happens on your two Sparks. Spark Control SSHes into Spark 1 to invoke launch-cluster.sh, and HTTP-polls both Sparks for health.


Prerequisites before installing the s9pk

You need all of the following set up first. The s9pk assumes they exist.

Hardware

  • A Start9 server running StartOS 0.4.x with sideload-install enabled.
  • Two NVIDIA DGX Sparks (or similar boxes with NVIDIA GPUs + Docker). One will be "Spark 1" (head node) and one will be "Spark 2" (worker node + audio services). They must be on the same LAN as the Start9 server.

Spark 1 (the head node)

  • A Linux user account you can SSH into (any username — ubuntu, nvidia, your own — just be consistent). Note the username; you'll enter it later.

  • Docker + NVIDIA Container Toolkit installed and working.

  • ~/spark-vllm-docker/ cloned from the community repo:

    git clone https://github.com/eugr/spark-vllm-docker ~/spark-vllm-docker
    cd ~/spark-vllm-docker
    ./build-and-copy.sh -c    # builds the vLLM container image
    

    The path matters. Spark Control hardcodes ~/spark-vllm-docker as the working directory for cluster commands. If you clone it elsewhere, the dashboard's swap and download actions will silently fail.

  • A HuggingFace cache at ~/.cache/huggingface/hub/. Either pre-download one model now, or use the dashboard's "Download a new model" button after install.

Spark 2 (the worker node)

  • Same Linux user account as Spark 1, with passwordless SSH from Spark 1 working.
  • Docker + NVIDIA Container Toolkit installed.
  • That's it — the rest can be installed through the Spark Control dashboard once it's running.
  • An NVIDIA NGC personal API key if you want to install Parakeet (STT) from nvcr.io. Free: https://ngc.nvidia.com/setup/personal-key. Starts with nvapi-.... (Not needed for Kokoro — it's Apache 2.0 and pulls from a public GitHub Container Registry image with no auth.)

Install steps

1. Sideload the s9pk

In your Start9 web UI, go to Sideload Service and upload the spark-control_*.s9pk file (x86_64 or aarch64 depending on your Start9). Install it.

2. Start the service once

The first start generates an ed25519 SSH keypair inside the package volume. Wait until the service shows "Running" status — should take only a few seconds.

3. Show the public key and install it on both Sparks

  • Open Spark Control → Actions → Show Public Key.
  • If you haven't run Configure Sparks yet, you'll just see the raw key. Skip to step 4, then come back here.
  • Once Configure Sparks is filled in, this action produces a ready-to-paste install command (a multi-line ssh ... 'echo ... >> authorized_keys' block). Copy the entire block.
  • Run it in a terminal on a machine that already has SSH access to your Sparks. You'll be prompted for each Spark's SSH password once. After it completes, the Start9 server can SSH into both Sparks.

4. Configure Sparks

  • Open Spark Control → Actions → Configure Sparks.
  • Fill in:
    • Spark 1 hostname or IP — prefer the IP (e.g. 192.168.1.x) over .local hostnames; vLLM only binds IPv4 and mDNS can resolve to IPv6 first.
    • Spark 1 SSH user — whatever username you set up on Spark 1.
    • Spark 2 hostname or IP + SSH user — same idea.
    • Optional Parakeet/Kokoro overrides — leave blank if those services run on Spark 2 (the normal case).
    • Optional Open WebUI URL — paste your Open WebUI LAN URL to get a deep-link button in the dashboard next to the current model.
    • Optional NGC API key — paste it here if you have one.

Save.

5. Re-run Show Public Key (if you skipped earlier)

Now that hosts are configured, Show Public Key will give you the paste-ready install command. Run it as described in step 3.

6. Open the Web UI

From the Spark Control service page, click the Web UI button. You should see:

  • A top status bar with the currently loaded LLM (or "no model loaded" if Spark 1's vLLM container is fresh).
  • An LLM tab with cards for each model in the bundled catalog. Models you've downloaded show "on disk" badges; others show "not downloaded".
  • An Audio / Speech tab with health status and Install / Start / Stop / Restart buttons for Parakeet and Kokoro.

If the dashboard loads and both Spark hardware cards show CPU/RAM/GPU stats, you're in.

7. Load your first LLM

Click "Switch to this" on any model card. The dashboard will:

  1. SSH into Spark 1, stop any running vLLM container.
  2. Run launch-cluster.sh with the model's bundled flags.
  3. Stream docker logs -f back to your browser until Application startup complete. appears.
  4. Mark the new model as active.

Typical times: solo-mode models (Qwen3.6, Gemma 4) take ~35 min. Cluster-mode models (Qwen3-VL 235B) take ~58 min — they have to coordinate across both Sparks via Ray.

8. (Optional) install audio services

From the Audio / Speech tab, click Install Parakeet. This pulls and starts the parakeet-asr container on Spark 2 with appropriate settings. Takes ~23 min for the first install.

For diarization with speaker fingerprints, also click Reapply patches — that overlays Sortformer + TitaNet support onto the parakeet container. The patches survive docker restart but are wiped by docker rm; if you ever recreate the container, re-run Reapply patches.

Kokoro TTS is similar — pull ghcr.io/remsky/kokoro-fastapi-gpu:latest on Spark 2 and run with --gpus all -p 8880:8880. No NGC key required (Kokoro is Apache 2.0). Boots in ~5 seconds and uses only ~1.3 GB of GPU memory. (A one-click Kokoro install action is planned for a near-future release; for now you can install it manually or Spark Control will pick it up automatically once it's running on port 8880.)


Endpoints exposed to your other apps

Once Spark Control is healthy, your other LAN apps can hit it as a single trusted backend:

Path Backend Notes
GET /api/endpoints (self) Service discovery — JSON of base_urls + ready flags. Hit this first so you don't have to hardcode Spark IPs in other apps.
POST /v1/chat/completions vLLM on Spark 1 OpenAI-compatible; supports stream: true
POST /v1/completions vLLM on Spark 1 Legacy OpenAI completions
POST /v1/audio/transcriptions Parakeet on Spark 2 OpenAI-compatible STT
POST /v1/audio/speech Kokoro on Spark 2 OpenAI-compatible TTS. Default voice bm_george; pass voice to pick any of Kokoro's 67 voices. Reliable at any input length (no chunking/retry needed).
POST /api/audio/diarize-chunk Sortformer + TitaNet Per-chunk diarization with voice fingerprints for cross-chunk re-clustering
POST /api/audio/transcribe-with-speakers Parakeet + Sortformer One-shot transcribe + diarize, merged

All of these inherit Spark Control's TLS cert and StartOS access controls. You only need one allowlist entry in downstream apps.


Operational notes

  • vLLM does not auto-load a model after a power loss. When your Sparks reboot, the dashboard will show "no model loaded" — you click "Switch to this" on whichever LLM you want. Parakeet/Kokoro auto-restart with their containers (Kokoro is --restart unless-stopped and Parakeet runs the same way).
  • Single-slot chunked workflows. If you're calling /v1/audio/transcriptions or /api/audio/diarize-chunk in chunked workflows, send chunks sequentially, not in parallel. Parallel requests can trigger a known cuFFT race on the Spark 2 GPU that returns a 503 + Retry-After. Spark Control recovers automatically but each retry costs ~60s.
  • Context window: the bundled Qwen3.6 entry runs at 64K total tokens (input + output combined). Adjust per-model via the Advanced button on each card.
  • Update path: model-catalog overrides and custom services live in /data/* inside the volume; they survive s9pk updates.

Resources

  • README.md — repo overview, build instructions, dev environment
  • runbook.md — model-swap recipes and operating notes
  • known-issues.md — debugging fixes (Mamba block-size, vision token budget, historical Magpie notes, etc.)
  • Source: image/ is the FastAPI app; package/ is the StartOS wrapper. The s9pk build is cd package && make x86 (or aarch64).

If you're an AI agent helping with this install

A few things worth knowing:

  • The codebase is two halves: image/ is a standalone FastAPI app you can run with uvicorn app.server:app for local dev. package/ is the StartOS wrapper. Changes to either should be coordinated.
  • All connection info comes from environment variables in image/app/config.py, populated from package/startos/fileModels/sparkConfig.yaml.ts via the Configure Sparks action. No IPs, usernames, or paths are hardcoded in runtime code.
  • The path ~/spark-vllm-docker is hardcoded in swap.py, download.py, updates.py, and models.py. If the user has cloned the upstream repo elsewhere, either fix the path or symlink it.
  • Persistent state lives at /data/ inside the container: config.yaml, models-overrides.yaml, services-overrides.yaml, connectivity.json, ssh/. These survive package updates.
  • The dashboard polls every 5 s; check image/app/health.py and image/app/connectivity.py for the probing logic. External apps can also POST failures to /api/health-event to log between-poll blips.
  • Debugging audio issues: SSH into Spark 2 and run docker logs --tail 100 parakeet-asr. cuFFT errors usually mean parallel requests; see the operational note above.
  • Debugging LLM swaps: the swap log is streamed in the browser, but the underlying docker logs -f vllm_node on Spark 1 is the ground truth.
  • The package supports both x86_64 and aarch64 builds. Match your Start9 server architecture.