98988057a2
Replace real cluster IPs/hosts/usernames and example names with neutral placeholders across docs, ops notes, package install text, and the offline redaction test; delete the obsolete build-time starter prompt. Closes the portability audit's single blocker. No runtime behavior change.
46 lines
4.5 KiB
Markdown
46 lines
4.5 KiB
Markdown
# Known issues
|
|
|
|
## Magpie removed in v0.14.0 (2026-06-03)
|
|
|
|
**Why**: Magpie/Riva's TTS decoder had a structural defect — ~30% truncation rate at short inputs, ~50%+ at multi-sentence inputs, fresh-container restart did not help. Reproduced server-side and confirmed in Riva's own logs (status:0 with implausibly short audio_duration). Switching to Riva's streaming endpoint did not help — same failure rate. Even with v0.13.0:5's retry layer and v0.13.0:6's server-side chunking, end-to-end reliability capped at ~85%.
|
|
|
|
**What replaced it**: Kokoro-82M (Apache 2.0) via `ghcr.io/remsky/kokoro-fastapi-gpu`. 24/24 successful renders across the same input lengths that broke Magpie 13/24 times, ~1s wallclock per call, 1.3 GB GPU memory (vs Magpie's 49 GB). No retry/chunking layer needed in the proxy. Default voice `bm_george`; curated quick-picks include `bf_emma`, `am_michael`, `af_heart`.
|
|
|
|
The old chunking/retry workaround in `audio_proxy.py` and the Magpie sections in the dashboard, config, services, and deep_health modules were all removed in v0.14.0. Migration: existing users need to pull and run the Kokoro container on Spark 2 (one `docker run` command), then either let Spark Control auto-discover it or update Configure Sparks if running on a non-default host.
|
|
|
|
## ~~magpie-tts crash loop (Spark 2)~~ — RESOLVED 2026-05-12, then Magpie removed entirely 2026-06-03
|
|
|
|
**What Magpie is:** NVIDIA's multilingual text-to-speech (TTS) model, served via the NIM (NVIDIA Inference Microservices) framework — a Riva Speech Server container that converts text into spoken audio. It's the counterpart to Parakeet (which is speech-to-text / STT). When working, it exposes `/v1/audio/speech` on port 9000 and is used by clients like Open WebUI for the "read aloud" feature.
|
|
|
|
**Root cause:** The Docker volume `magpie-model-cache` (mounted at `/opt/nim/.cache` in the container) was owned by `root:root` on the host. The container drops to uid 1000 (`riva-server`) and tries to download model manifests into that cache — got `I/O error Permission denied (os error 13)` — exited 1 — restarted infinitely (24,000+ restarts before the fix).
|
|
|
|
**Fix:**
|
|
|
|
```bash
|
|
ssh <spark-user>@<spark-2-host> 'docker run --rm -v magpie-model-cache:/cache alpine chown -R 1000:1000 /cache && docker restart magpie-tts'
|
|
```
|
|
|
|
The trick is the `docker run --rm alpine chown` — it runs as root inside the throwaway container, which is enough to chown the bind-mounted volume on the host, without needing `sudo` on the host itself. After the chown + restart, magpie downloaded its ~3 GB model from NGC into the cache and came up healthy on `:9000`.
|
|
|
|
**If it ever recurs** (e.g. after wiping the volume), re-run that one command.
|
|
|
|
## Qwen3.6-35B-A3B `--moe_backend=flashinfer_cutlass` may fail on launch
|
|
|
|
This flag is Blackwell-specific. If vLLM in the container reports `unrecognized arguments: --moe_backend` or similar, edit `models.yaml` for `qwen36` and drop that flag. The swap UI does NOT auto-fallback in v0.1 — failure surfaces in the log stream.
|
|
|
|
## Qwen3.6 Mamba block-size assertion (fixed in v0.6.0:1)
|
|
|
|
Qwen3.6 uses a Mamba-attention hybrid that requires `--max-num-batched-tokens >= 2096`. vLLM's default is 2048, which trips `AssertionError: In Mamba cache align mode, block_size (2096) must be <= max_num_batched_tokens (2048)`. Fix: bake `--max-num-batched-tokens=16384` into the bundled qwen36 entry — matches the upstream qwen3.5-35b-a3b-fp8 recipe.
|
|
|
|
## Multimodal token budget for vision models (fixed in v0.8.0:1)
|
|
|
|
After the eugr/spark-vllm-docker update, vLLM became stricter about multimodal token budgets. Vision-capable models like Gemma 4 31B and Qwen3-VL crash at engine init with `ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048)`. Fix: bake `--max-num-batched-tokens=16384` into every model that has the `vision` capability. Now applied to qwen3-vl, gemma4, and qwen36 (which was already set for the Mamba issue).
|
|
|
|
## Two SSH paths to Spark 1 from the laptop
|
|
|
|
`ssh <spark-user>@<spark-1-ip>` does NOT work from the laptop because the NVIDIA Sync ssh_config only has a Host entry for the Spark's `.local` mDNS name, not its bare IP. Always SSH via the `<spark-1-host>.local` hostname (or another entry that the ssh_config actually matches) rather than the raw IP.
|
|
|
|
## Older models in `models.yaml`
|
|
|
|
The `qwen3-235b-fp8` and `qwen25-72b` catalog entries are conservative guesses for vLLM flags — they're on disk but were never the focus of this project. First launch of either may fail or be suboptimal; capture working flags here.
|