ed54f85442
Volume magpie-model-cache was owned by root, container drops to uid 1000. Fix: docker run --rm -v magpie-model-cache:/cache alpine chown -R 1000:1000 /cache + docker restart magpie-tts. After ~3 GB NGC model download, healthy on :9000.
30 lines
2.2 KiB
Markdown
30 lines
2.2 KiB
Markdown
# Known issues
|
|
|
|
## ~~magpie-tts crash loop (Spark 2)~~ — RESOLVED 2026-05-12
|
|
|
|
**What Magpie is:** NVIDIA's multilingual text-to-speech (TTS) model, served via the NIM (NVIDIA Inference Microservices) framework — a Riva Speech Server container that converts text into spoken audio. It's the counterpart to Parakeet (which is speech-to-text / STT). When working, it exposes `/v1/audio/speech` on port 9000 and is used by clients like Open WebUI for the "read aloud" feature.
|
|
|
|
**Root cause:** The Docker volume `magpie-model-cache` (mounted at `/opt/nim/.cache` in the container) was owned by `root:root` on the host. The container drops to uid 1000 (`riva-server`) and tries to download model manifests into that cache — got `I/O error Permission denied (os error 13)` — exited 1 — restarted infinitely (24,000+ restarts before the fix).
|
|
|
|
**Fix:**
|
|
|
|
```bash
|
|
ssh <spark-user>@<spark-2-host> 'docker run --rm -v magpie-model-cache:/cache alpine chown -R 1000:1000 /cache && docker restart magpie-tts'
|
|
```
|
|
|
|
The trick is the `docker run --rm alpine chown` — it runs as root inside the throwaway container, which is enough to chown the bind-mounted volume on the host, without needing `sudo` on the host itself. After the chown + restart, magpie downloaded its ~3 GB model from NGC into the cache and came up healthy on `:9000`.
|
|
|
|
**If it ever recurs** (e.g. after wiping the volume), re-run that one command.
|
|
|
|
## Qwen3.6-35B-A3B `--moe_backend=flashinfer_cutlass` may fail on launch
|
|
|
|
This flag is Blackwell-specific. If vLLM in the container reports `unrecognized arguments: --moe_backend` or similar, edit `models.yaml` for `qwen36` and drop that flag. The swap UI does NOT auto-fallback in v0.1 — failure surfaces in the log stream.
|
|
|
|
## Two SSH paths to Spark 1 from the laptop
|
|
|
|
`ssh <spark-user>@<spark-1-ip>` does NOT work from the laptop because the NVIDIA Sync ssh_config only has a Host entry for `<spark-1-host>.local`. Always use the `.local` hostname or `<spark-2-ip>`-style entries that ARE matched.
|
|
|
|
## Older models in `models.yaml`
|
|
|
|
The `qwen3-235b-fp8` and `qwen25-72b` catalog entries are conservative guesses for vLLM flags — they're on disk but were never the focus of this project. First launch of either may fail or be suboptimal; capture working flags here.
|