Files
spark-control/docs/guides/audio-speech.md
T
Keysat 6a6112a15f restructure: AGENTS.md canonical + docs/guides with .claude/rules symlinks
Rename CLAUDE.md -> AGENTS.md (cross-vendor standard) with a relative
CLAUDE.md symlink so Claude Code still loads it. Move each .claude/rules
file into docs/guides/ (paths: frontmatter preserved) and replace the
rules file with a relative symlink into the guide. Repoint the AGENTS.md
index paragraph at docs/guides/ so non-Claude agents find the guides.
2026-06-12 14:27:17 -05:00

2.1 KiB

paths
paths
image/app/audio_proxy.py
image/app/speech_models.py
image/app/deep_health.py
image/parakeet_patches/**
scripts/test-audio-with-speakers.sh
docs/AUDIO_API.md

Audio / speech stack (Parakeet STT + Sortformer diarizer + Kokoro TTS on Spark 2)

Changing the parakeet-asr container

  • image/parakeet_patches/ (main.py, diarizer.py) is an overlay copied into the parakeet-asr container by the "Reapply speech-model patches" dashboard action (image/app/speech_models.py). This is the only durable way to change that container — docker exec / pip changes inside it die on docker rm.
  • Never install cuda-python in parakeet-asr to "fix" the startup warning about CUDA graphs being disabled. The warning is harmless; enabling the graph path crashes real decode with illegal memory access on this GPU/CUDA-13 stack (GB10/sm_121). The slow path served 11k+ requests with zero failures — leave it alone.
  • Pin/constrain torch versions when pip-installing anything into NGC-based containers on the Sparks (ABI breaks otherwise); expect ARM64 wheel gaps and source builds (--no-build-isolation for torchaudio). Applies to spark_embed too.

Testing audio endpoints

  • Test with real speech (e.g. say -o /tmp/t.wav --data-format=LEI16@16000 "<a couple of sentences>"), not tones/silence — zero-token audio skips the decoder paths where crashes live.
  • Send audio requests to Spark 2 sequentially in tests/scripts. Parallel audio requests can race (cuFFT → 503), and the single GPU serializes them anyway.
  • End-to-end suite (hits the LIVE cluster):
./scripts/test-audio-with-speakers.sh <audio-file>   # from repo root

SPARK_CONTROL defaults to http://127.0.0.1:9999 (a running local dev server); point it at the installed package URL otherwise.

API quirk

Spark Control's /v1/models lists audio models (STT model + Kokoro voices) by design — not the loaded LLM. Discover the LLM via /api/status (vllm.current_model).

Diarizer caps at 4 speakers (Sortformer diar_sortformer_4spk-v1).