spark-control

Author	SHA1	Message	Date
Keysat	e775906caa	v0.13.0:1 - per-chunk diarization worker with TitaNet voice fingerprints Spark Control now exposes a per-chunk worker designed for Recap Relay to orchestrate against. Recap Relay does the chunking + global speaker clustering (consistent with how it already handles the Gemini path); Spark Control handles the GPU-bound per-chunk work. Parakeet container: - diarizer.py: now also loads NVIDIA TitaNet speaker-verification model (~25 MB, NeMo-native, no torchaudio). New diarize_chunk() method runs Sortformer + extracts one 192-dim voice fingerprint per detected local speaker (concatenating each speaker's audio across the chunk and running TitaNet's get_embedding). - main.py: new POST /v1/audio/diarize-chunk endpoint that returns segments + speakers_detected + fingerprints + models in one shot. Spark Control: - new POST /api/audio/diarize-chunk that proxies to parakeet's new endpoint. Same CUDA-wedge recovery (503 + deep-health probe + 60s retry-after) as the other audio endpoints. Returns the raw JSON upstream because Recap Relay is the consumer; no merging needed. Response shape Recap Relay receives per chunk: { "duration": 300.0, "segments": [{"start_s","end_s","speaker"}, ...], # LOCAL labels "speakers_detected": ["Speaker_0","Speaker_1",...], "fingerprints": {"Speaker_0":[192 floats], ...}, "models": {"diarization":"...","embedding":"..."} } Recap Relay's job: 1. Chunk audio (existing chunking infrastructure) 2. POST each chunk to /api/audio/diarize-chunk in parallel 3. Collect all fingerprints from all chunks 4. sklearn AgglomerativeClustering(distance_threshold=0.7, metric=cosine) 5. Re-label segments with global cluster IDs 6. Concatenate transcripts (from a separate parallel call to /v1/audio/transcriptions) with timestamp offsets and merge with re-labeled diar segments After installing v0.13.0:1, click "Reapply patches" on the Speech Models card to push the updated diarizer.py + main.py into the parakeet container — TitaNet will download (~25 MB) on first call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 11:37:05 -05:00
Keysat	95524f4983	v0.13.0:0 - revert WhisperX migration; back to Parakeet + Sortformer After five hotfix iterations on the WhisperX install (v0.12.0:0–:4) we never got a working docker build. The fundamental constraint isn't patchable from outside NVIDIA: NGC PyTorch on ARM64 (the only base that runs on Spark 2's GB10 Blackwell) ships a custom-versioned torch 2.10.0a0+b558c98 that has no pre-built torchaudio match anywhere. WhisperX → pyannote → torchaudio is a hard dependency chain we couldn't satisfy without rebuilding torchaudio against torch 2.10's alpha API. Walking away cleanly is better than another night of chasing. Removed from the codebase: - image/whisperx_container/* (Dockerfile + requirements + app/main.py) - image/app/whisperx_install.py (install manager + SSH ship-context logic) - image/Dockerfile COPY whisperx_container - WHISPERX_* config keys in config.py - whisperx service entry in services.py - WhisperX-preferred branch in audio_proxy.py - /api/whisperx/* endpoints in server.py - install banner + progress dialog in index.html - render + handlers in app.js - .whisperx-install styles in style.css Spark 2 cleaned in tandem (user-authorized): container removed, ~/whisperx-build/ removed, 5.4 GB of dangling image layers + 1.3 GB of builder cache reclaimed. parakeet-asr and magpie-tts unaffected and healthy throughout. The audio path is back to exactly what shipped in v0.11.0:3: POST /api/audio/transcribe-with-speakers → Parakeet (transcription) + Sortformer (diarization) in parallel → merged by timestamp into speaker-labeled blocks v0.13.0:1+ will add the actually-needed fixes that the WhisperX detour was meant to address: 1. memory cap on the parakeet-asr container so a long-audio crash can't swap-thrash Spark 2 again 2. a chunking proxy in /api/audio/transcribe-with-speakers that splits inputs >10 min before Sortformer Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 08:03:19 -05:00
Keysat	5a0bfba6a3	v0.12.0:0 - WhisperX as a one-click dashboard install + managed service Replaces the manual rsync+build+run with a proper spark-control feature. First in the audio path that doesn't require shell access on Spark 2. What's in the box ───────────────── * image/whisperx_container/ - the build context (Dockerfile, requirements, app/main.py FastAPI wrapper). Mainline pipeline: faster-whisper for STT + pyannote 3.1 for diarization + wav2vec2 forced alignment. Single endpoint /v1/audio/transcribe-with-speakers returns the exact same shape spark- control's existing endpoint does, so the recap-relay PR spec needs no changes when we cut over. * image/app/whisperx_install.py - install manager. ships build context to Spark 2 over SSH, runs `docker build`, runs `docker run` with 40 GB memory cap (vs Sortformer's unbounded which thrashed Spark 2 on a 90-min file), polls /health until both Whisper + pyannote report loaded. * Audio proxy: /api/audio/transcribe-with-speakers now prefers WhisperX when its /health reports diarizer_loaded=true, falls back to the legacy Parakeet + Sortformer path otherwise. Same response shape either way. Clean cutover, easy rollback (`docker rm whisperx-asr`). * Dashboard (Audio / Speech tab): - "Add WhisperX" banner appears when not installed, with a primary "Install WhisperX" button. One click triggers the install. - Build progress dialog with phase + elapsed timer + live build log via SSE (`/api/whisperx/install/{job_id}/stream`). - After install, WhisperX auto-registers as a managed service alongside Parakeet and Magpie (Start/Restart/Stop, deep-check, auto-restart). - Banner self-hides once /api/whisperx/status reports healthy. New endpoints ───────────── GET /api/whisperx/status POST /api/whisperx/install GET /api/whisperx/install/{job_id} GET /api/whisperx/install/{job_id}/stream (SSE phase + log) Config additions (env) ────────────────────── WHISPERX_HOST (defaults to spark2_host) WHISPERX_USER (defaults to spark2_user) WHISPERX_CONTAINER (default: whisperx-asr) WHISPERX_PORT (default: 8002) WHISPERX_MODEL (default: medium; tiny/base/small/medium/large-v3) Dockerfile ────────── Added COPY whisperx_container /app/whisperx_container so the runtime install manager can read the build context from inside the spark-control image and ship it over SSH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 21:02:26 -05:00
Keysat	fda23088fe	v0.10.0:1 - hotfix: merge function now joins words with proper spacing Smoke testing v0.10.0:0 against a real anarlog audio.mp3 showed the output running words together: "I'mrecordingrightnow", "don'tyoutry". Root cause: _merge_words_with_speakers was doing "".join(cur_words), assuming Parakeet returns words with leading whitespace (which the hyprnote local Parakeet does, but the Spark-hosted Parakeet does not). Rewrote the join with a small helper that: - Strips each token (handles both leading-space and no-leading-space word formats) - Joins with a single space - Keeps punctuation tight — no space before period/comma/colon/etc. Verified post-install with the same test audio: [00:06] Speaker_0: I'm I'm recording right now. [00:18] Speaker_1: you're you're on your computer and your phone, right? No other changes — Parakeet container patches and the endpoint shape stay identical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:42:04 -05:00
Keysat	713cd09cc2	v0.10.0:0 - speaker diarization via Sortformer + merged transcribe-with-speakers Adds a new pipeline for diarized transcription that any client (recap-relay, ad-hoc curl, future Mac-side tools) can call. Pure data pipeline, no LLM or UI included — name resolution / analysis happen downstream where prompts and rendering are configurable. Architecture: Spark 2 / parakeet-asr container: + /opt/parakeet/app/diarizer.py (new: SortformerDiarizer class) + /opt/parakeet/app/main.py (patched: loads diarizer, adds /v1/audio/diarize endpoint) Model: nvidia/diar_sortformer_4spk-v1 (~150 MB, ungated, NeMo native) Spark Control: + POST /api/audio/transcribe-with-speakers Body: multipart file Returns: { duration, language, speakers_detected, segments: [{start_ms, end_ms, speaker, text}, ...], models: {transcription, diarization} } Runs Parakeet ASR + Sortformer in parallel, merges words to speaker turns by timestamp, groups into speaker-change blocks (breaks also on >1.5s silence gaps). + If Parakeet 500s mid-pipeline, kicks deep-health probe and returns 503/Retry-After: 60 — same wedge-recovery pattern as v0.9.0:2. Apply Sortformer patches to the running Parakeet container with: bash image/parakeet_patches/apply.sh <spark2-host> <ssh-user> Patches are reversible — apply.sh backs up the original main.py inside the container at main.py.pre-sortformer before overwriting. Restore by copying that file back and removing diarizer.py, then docker restart. v0.11 follow-up: dashboard "Speech Models" panel to swap/update model versions from the UI instead of needing to re-run apply.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:14:48 -05:00
Keysat	197655a62b	v0.9.0:2 - audio proxy: turn Parakeet wedge 500 into clean 503 + immediate auto-restart Parakeet's recurring CUDA wedge (CUBLAS_STATUS_*_ERROR mid-attention) fires reliably on Open WebUI's WebM/Opus->MP3 audio. Previously the proxy relayed the upstream 500 verbatim, Open WebUI showed "Server connection error" with no signal to retry, and recovery took up to 5 minutes (waiting for the next periodic deep-health probe). Now the proxy: 1. Detects 500 from /v1/audio/transcriptions 2. Fires deep_health.run_one("parakeet") as a background asyncio task (which contains the same wedge-detect + rate-limited auto-restart logic, but runs immediately instead of waiting for the next tick) 3. Returns 503 with a clear detail message and Retry-After: 60 The client (Open WebUI, Home Assistant, etc.) gets a proper retry signal; the auto-restart triggers inside seconds; the next attempt ~60s later succeeds. Rate-limiting (3 restarts per 30 min) is inherited from the deep-health module so this can't cause restart storms. server.py: pass deep_health into build_audio_router(). audio_proxy.py: new 503-with-restart branch; signature now accepts deep_health as an optional dependency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 18:07:35 -05:00
Keysat	f44e7f8b03	v0.9.0:0 - OpenAI-compatible audio proxy for Open WebUI / Home Assistant Adds three new endpoints to spark-control that translate OpenAI's audio API shapes to the Parakeet (STT) and Magpie (TTS, NVIDIA Riva) services on the Sparks: GET /v1/models — STT model + Magpie's 60+ voices POST /v1/audio/speech — OpenAI body -> Magpie multipart synthesize (returns audio/wav passthrough) POST /v1/audio/transcriptions — relay to Parakeet (already compatible) Verified shapes against the live services: - Parakeet returns OpenAI-style {"text": "..."} or verbose_json with segments+words. Already a perfect drop-in for OpenAI clients. - Magpie returns raw WAV bytes with Content-Type: audio/wav. NOT base64-wrapped JSON as one might assume. The proxy is literally a body-translation on the request side; response is passthrough. Voice language is auto-derived from the voice name (e.g. Magpie-Multilingual.EN-US.Mia -> language=en-US) so clients don't need to set it explicitly. Open WebUI / Home Assistant / Recap Relay can now all point at one URL — https://<spark-control>.local/v1 — and get LLM, STT, TTS behind a single identity. No shim service to deploy. Pure addition: no existing routes touched; the dashboard, /api/*, download flow, deep-health, hardware probes are all unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 16:41:48 -05:00

7 Commits