known-issues: mark magpie crash loop RESOLVED with chown fix recipe
Volume magpie-model-cache was owned by root, container drops to uid 1000. Fix: docker run --rm -v magpie-model-cache:/cache alpine chown -R 1000:1000 /cache + docker restart magpie-tts. After ~3 GB NGC model download, healthy on :9000.
This commit is contained in:
+6
-19
@@ -1,33 +1,20 @@
|
||||
# Known issues
|
||||
|
||||
## magpie-tts crash loop (Spark 2)
|
||||
## ~~magpie-tts crash loop (Spark 2)~~ — RESOLVED 2026-05-12
|
||||
|
||||
**What Magpie is:** NVIDIA's multilingual text-to-speech (TTS) model, served via the NIM (NVIDIA Inference Microservices) framework — a Riva Speech Server container that converts text into spoken audio. It's the counterpart to Parakeet (which is speech-to-text / STT). When working, it exposes `/v1/audio/speech` on port 9000 and is used by clients like Open WebUI for the "read aloud" feature.
|
||||
|
||||
The `magpie-tts` container at `nvcr.io/nim/nvidia/magpie-tts-multilingual:latest` is in a restart loop and `:9000` is not reachable. **Status as of 2026-05-12: unfixed. UI surfaces a red dot.**
|
||||
**Root cause:** The Docker volume `magpie-model-cache` (mounted at `/opt/nim/.cache` in the container) was owned by `root:root` on the host. The container drops to uid 1000 (`riva-server`) and tries to download model manifests into that cache — got `I/O error Permission denied (os error 13)` — exited 1 — restarted infinitely (24,000+ restarts before the fix).
|
||||
|
||||
**Root cause (from `docker logs magpie-tts`):**
|
||||
|
||||
```
|
||||
nimlib.exceptions.ManifestDownloadError: Error downloading manifest:
|
||||
I/O error Permission denied (os error 13)
|
||||
```
|
||||
|
||||
The container exits 1 from `nimutils.download_models()` when fetching `nim/nvidia/magpie-tts-multilingual` model files from NGC. The "permission denied" is a local filesystem error — the container can't write the model cache where it expects to.
|
||||
|
||||
**To diagnose further:**
|
||||
**Fix:**
|
||||
|
||||
```bash
|
||||
ssh <spark-user>@<spark-2-ip>
|
||||
docker inspect magpie-tts | jq '.[].HostConfig.Mounts, .[].Config.Env'
|
||||
# Look for: the mount path for the model cache, and whether NGC_API_KEY is set.
|
||||
ssh <spark-user>@<spark-2-host> 'docker run --rm -v magpie-model-cache:/cache alpine chown -R 1000:1000 /cache && docker restart magpie-tts'
|
||||
```
|
||||
|
||||
**Likely fixes (untried):**
|
||||
The trick is the `docker run --rm alpine chown` — it runs as root inside the throwaway container, which is enough to chown the bind-mounted volume on the host, without needing `sudo` on the host itself. After the chown + restart, magpie downloaded its ~3 GB model from NGC into the cache and came up healthy on `:9000`.
|
||||
|
||||
1. Chown the bind-mounted cache directory on Spark 2 to the UID the container runs as.
|
||||
2. Set an `NGC_API_KEY` env var (NIM containers need this for non-public artifacts).
|
||||
3. Confirm there's free disk space.
|
||||
**If it ever recurs** (e.g. after wiping the volume), re-run that one command.
|
||||
|
||||
## Qwen3.6-35B-A3B `--moe_backend=flashinfer_cutlass` may fail on launch
|
||||
|
||||
|
||||
Reference in New Issue
Block a user