Volume magpie-model-cache was owned by root, container drops to uid 1000. Fix: docker run --rm -v magpie-model-cache:/cache alpine chown -R 1000:1000 /cache + docker restart magpie-tts. After ~3 GB NGC model download, healthy on :9000.
2.2 KiB
Known issues
magpie-tts crash loop (Spark 2) — RESOLVED 2026-05-12
What Magpie is: NVIDIA's multilingual text-to-speech (TTS) model, served via the NIM (NVIDIA Inference Microservices) framework — a Riva Speech Server container that converts text into spoken audio. It's the counterpart to Parakeet (which is speech-to-text / STT). When working, it exposes /v1/audio/speech on port 9000 and is used by clients like Open WebUI for the "read aloud" feature.
Root cause: The Docker volume magpie-model-cache (mounted at /opt/nim/.cache in the container) was owned by root:root on the host. The container drops to uid 1000 (riva-server) and tries to download model manifests into that cache — got I/O error Permission denied (os error 13) — exited 1 — restarted infinitely (24,000+ restarts before the fix).
Fix:
ssh <spark-user>@<spark-2-host> 'docker run --rm -v magpie-model-cache:/cache alpine chown -R 1000:1000 /cache && docker restart magpie-tts'
The trick is the docker run --rm alpine chown — it runs as root inside the throwaway container, which is enough to chown the bind-mounted volume on the host, without needing sudo on the host itself. After the chown + restart, magpie downloaded its ~3 GB model from NGC into the cache and came up healthy on :9000.
If it ever recurs (e.g. after wiping the volume), re-run that one command.
Qwen3.6-35B-A3B --moe_backend=flashinfer_cutlass may fail on launch
This flag is Blackwell-specific. If vLLM in the container reports unrecognized arguments: --moe_backend or similar, edit models.yaml for qwen36 and drop that flag. The swap UI does NOT auto-fallback in v0.1 — failure surfaces in the log stream.
Two SSH paths to Spark 1 from the laptop
ssh <spark-user>@<spark-1-ip> does NOT work from the laptop because the NVIDIA Sync ssh_config only has a Host entry for <spark-1-host>.local. Always use the .local hostname or <spark-2-ip>-style entries that ARE matched.
Older models in models.yaml
The qwen3-235b-fp8 and qwen25-72b catalog entries are conservative guesses for vLLM flags — they're on disk but were never the focus of this project. First launch of either may fail or be suboptimal; capture working flags here.