Add safe optimization flags to gemma4 + qwen36 (fastsafetensors, prefix-caching, fp8 kv)
Aligned with sibling recipes in eugr/spark-vllm-docker. Applies on next swap to each model. First real swap gemma4 -> qwen36 succeeded in 5:30 with --moe_backend=flashinfer_cutlass.
This commit is contained in:
@@ -1,5 +1,21 @@
|
||||
# Project: spark-control — Model switcher web UI for dual DGX Spark cluster
|
||||
|
||||
> **Update 2026-05-12 — Direction change:** the web UI is being built as a
|
||||
> **StartOS 0.4 package** (sideloaded onto Alice's existing Start9 server),
|
||||
> **not** as a FastAPI service running directly on Spark 1. The Start9 server
|
||||
> shares a LAN with the Sparks and SSHes into Spark 1 to invoke
|
||||
> `launch-cluster.sh`. StartOS handles `.local` exposure and HTTPS; SSH
|
||||
> credentials live in a per-install config file managed by a "Configure Sparks"
|
||||
> action. See <https://docs.start9.com/packaging/0.4.0.x/> for the packaging
|
||||
> model. Repo layout:
|
||||
>
|
||||
> - `image/` — Docker image source (FastAPI app, runs anywhere with `uvicorn`).
|
||||
> - `package/` — StartOS 0.4 wrapper (manifest, main, interfaces, actions).
|
||||
>
|
||||
> The "Phase 4: Deploy" section below (systemd on Spark 1) is **superseded** by
|
||||
> the StartOS sideload workflow. Other phases (models.yaml schema, swap script,
|
||||
> FastAPI endpoints, frontend) still apply but live inside `image/`.
|
||||
|
||||
## Goal
|
||||
|
||||
I want to build a small web service that gives me a browser-based interface to:
|
||||
|
||||
@@ -2,6 +2,8 @@
|
||||
|
||||
## magpie-tts crash loop (Spark 2)
|
||||
|
||||
**What Magpie is:** NVIDIA's multilingual text-to-speech (TTS) model, served via the NIM (NVIDIA Inference Microservices) framework — a Riva Speech Server container that converts text into spoken audio. It's the counterpart to Parakeet (which is speech-to-text / STT). When working, it exposes `/v1/audio/speech` on port 9000 and is used by clients like Open WebUI for the "read aloud" feature.
|
||||
|
||||
The `magpie-tts` container at `nvcr.io/nim/nvidia/magpie-tts-multilingual:latest` is in a restart loop and `:9000` is not reachable. **Status as of 2026-05-12: unfixed. UI surfaces a red dot.**
|
||||
|
||||
**Root cause (from `docker logs magpie-tts`):**
|
||||
|
||||
@@ -39,6 +39,9 @@ models:
|
||||
- --reasoning-parser=gemma4
|
||||
- --tool-call-parser=gemma4
|
||||
- --enable-auto-tool-choice
|
||||
- --load-format=fastsafetensors
|
||||
- --enable-prefix-caching
|
||||
- --kv-cache-dtype=fp8
|
||||
|
||||
qwen36:
|
||||
display_name: "Qwen3.6 35B-A3B (daily driver)"
|
||||
@@ -52,6 +55,9 @@ models:
|
||||
- --max-model-len=65536
|
||||
- --reasoning-parser=qwen3
|
||||
- --moe_backend=flashinfer_cutlass
|
||||
- --load-format=fastsafetensors
|
||||
- --enable-prefix-caching
|
||||
- --kv-cache-dtype=fp8
|
||||
|
||||
qwen3-235b-fp8:
|
||||
display_name: "Qwen3 235B-A22B FP8 (legacy)"
|
||||
|
||||
+14
@@ -2,6 +2,20 @@
|
||||
|
||||
Operating notes for running and maintaining the cluster via spark-control.
|
||||
|
||||
## Recent successful swaps
|
||||
|
||||
- **2026-05-12 — gemma4 → qwen36** via `POST /api/swap` from laptop dev server. ~5:30 to "Application startup complete." Inference works (`/v1/chat/completions` returns reasoning content via `reasoning` field). `--moe_backend=flashinfer_cutlass` confirmed valid by vLLM (logged "Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend").
|
||||
|
||||
## Optimization flags (added 2026-05-12)
|
||||
|
||||
Aligned `gemma4` and `qwen36` `vllm_args` with the project's sibling recipes (`qwen3.5-35b-a3b-fp8.yaml`, `gemma4-26b-a4b.yaml`):
|
||||
|
||||
- `--load-format=fastsafetensors` — faster cold-start weight load.
|
||||
- `--enable-prefix-caching` — reuse cached prefix tokens (e.g. system prompt) across requests.
|
||||
- `--kv-cache-dtype=fp8` — store KV cache in 8-bit FP; halves memory used per active context.
|
||||
|
||||
These take effect on the **next swap to that model**. If a swap fails after this change with errors mentioning fastsafetensors/prefix-caching/fp8, revert the entry in `models.yaml` and retry.
|
||||
|
||||
## Day-to-day
|
||||
|
||||
- The UI lives at `http://<your-start9>.local:9999` once the StartOS package is installed and configured.
|
||||
|
||||
Reference in New Issue
Block a user