Files

T

Grant ae8efa1754 Initial scaffold: image/ FastAPI app, models.yaml, docs

- image/ FastAPI app: /api/status, /api/swap, /api/swap/{id}/stream, /api/test-connection
- models.yaml: 5-model catalog (qwen3-vl, gemma4, qwen36, qwen3-235b-fp8, qwen25-72b)
- README, runbook, known-issues
- Dry-run swap verified against live Spark 1 (gemma4 currently loaded)

2026-05-12 09:29:13 -05:00

2.0 KiB

Raw Blame History

Known issues

magpie-tts crash loop (Spark 2)

The magpie-tts container at nvcr.io/nim/nvidia/magpie-tts-multilingual:latest is in a restart loop and :9000 is not reachable. Status as of 2026-05-12: unfixed. UI surfaces a red dot.

Root cause (from docker logs magpie-tts):

nimlib.exceptions.ManifestDownloadError: Error downloading manifest:
  I/O error Permission denied (os error 13)

The container exits 1 from nimutils.download_models() when fetching nim/nvidia/magpie-tts-multilingual model files from NGC. The "permission denied" is a local filesystem error — the container can't write the model cache where it expects to.

To diagnose further:

ssh <spark-user>@<spark-2-ip>
docker inspect magpie-tts | jq '.[].HostConfig.Mounts, .[].Config.Env'
# Look for: the mount path for the model cache, and whether NGC_API_KEY is set.

Likely fixes (untried):

Chown the bind-mounted cache directory on Spark 2 to the UID the container runs as.
Set an NGC_API_KEY env var (NIM containers need this for non-public artifacts).
Confirm there's free disk space.

Qwen3.6-35B-A3B `--moe_backend=flashinfer_cutlass` may fail on launch

This flag is Blackwell-specific. If vLLM in the container reports unrecognized arguments: --moe_backend or similar, edit models.yaml for qwen36 and drop that flag. The swap UI does NOT auto-fallback in v0.1 — failure surfaces in the log stream.

Two SSH paths to Spark 1 from the laptop

ssh <spark-user>@<spark-1-ip> does NOT work from the laptop because the NVIDIA Sync ssh_config only has a Host entry for <spark-1-host>.local. Always use the .local hostname or <spark-2-ip>-style entries that ARE matched.

Older models in `models.yaml`

The qwen3-235b-fp8 and qwen25-72b catalog entries are conservative guesses for vLLM flags — they're on disk but were never the focus of this project. First launch of either may fail or be suboptimal; capture working flags here.

2.0 KiB Raw Blame History