spark-control

Author	SHA1	Message	Date
Keysat	fda23088fe	v0.10.0:1 - hotfix: merge function now joins words with proper spacing Smoke testing v0.10.0:0 against a real anarlog audio.mp3 showed the output running words together: "I'mrecordingrightnow", "don'tyoutry". Root cause: _merge_words_with_speakers was doing "".join(cur_words), assuming Parakeet returns words with leading whitespace (which the hyprnote local Parakeet does, but the Spark-hosted Parakeet does not). Rewrote the join with a small helper that: - Strips each token (handles both leading-space and no-leading-space word formats) - Joins with a single space - Keeps punctuation tight — no space before period/comma/colon/etc. Verified post-install with the same test audio: [00:06] Speaker_0: I'm I'm recording right now. [00:18] Speaker_1: you're you're on your computer and your phone, right? No other changes — Parakeet container patches and the endpoint shape stay identical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:42:04 -05:00
Keysat	713cd09cc2	v0.10.0:0 - speaker diarization via Sortformer + merged transcribe-with-speakers Adds a new pipeline for diarized transcription that any client (recap-relay, ad-hoc curl, future Mac-side tools) can call. Pure data pipeline, no LLM or UI included — name resolution / analysis happen downstream where prompts and rendering are configurable. Architecture: Spark 2 / parakeet-asr container: + /opt/parakeet/app/diarizer.py (new: SortformerDiarizer class) + /opt/parakeet/app/main.py (patched: loads diarizer, adds /v1/audio/diarize endpoint) Model: nvidia/diar_sortformer_4spk-v1 (~150 MB, ungated, NeMo native) Spark Control: + POST /api/audio/transcribe-with-speakers Body: multipart file Returns: { duration, language, speakers_detected, segments: [{start_ms, end_ms, speaker, text}, ...], models: {transcription, diarization} } Runs Parakeet ASR + Sortformer in parallel, merges words to speaker turns by timestamp, groups into speaker-change blocks (breaks also on >1.5s silence gaps). + If Parakeet 500s mid-pipeline, kicks deep-health probe and returns 503/Retry-After: 60 — same wedge-recovery pattern as v0.9.0:2. Apply Sortformer patches to the running Parakeet container with: bash image/parakeet_patches/apply.sh <spark2-host> <ssh-user> Patches are reversible — apply.sh backs up the original main.py inside the container at main.py.pre-sortformer before overwriting. Restore by copying that file back and removing diarizer.py, then docker restart. v0.11 follow-up: dashboard "Speech Models" panel to swap/update model versions from the UI instead of needing to re-run apply.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:14:48 -05:00
Keysat	197655a62b	v0.9.0:2 - audio proxy: turn Parakeet wedge 500 into clean 503 + immediate auto-restart Parakeet's recurring CUDA wedge (CUBLAS_STATUS_*_ERROR mid-attention) fires reliably on Open WebUI's WebM/Opus->MP3 audio. Previously the proxy relayed the upstream 500 verbatim, Open WebUI showed "Server connection error" with no signal to retry, and recovery took up to 5 minutes (waiting for the next periodic deep-health probe). Now the proxy: 1. Detects 500 from /v1/audio/transcriptions 2. Fires deep_health.run_one("parakeet") as a background asyncio task (which contains the same wedge-detect + rate-limited auto-restart logic, but runs immediately instead of waiting for the next tick) 3. Returns 503 with a clear detail message and Retry-After: 60 The client (Open WebUI, Home Assistant, etc.) gets a proper retry signal; the auto-restart triggers inside seconds; the next attempt ~60s later succeeds. Rate-limiting (3 restarts per 30 min) is inherited from the deep-health module so this can't cause restart storms. server.py: pass deep_health into build_audio_router(). audio_proxy.py: new 503-with-restart branch; signature now accepts deep_health as an optional dependency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 18:07:35 -05:00
Keysat	f44e7f8b03	v0.9.0:0 - OpenAI-compatible audio proxy for Open WebUI / Home Assistant Adds three new endpoints to spark-control that translate OpenAI's audio API shapes to the Parakeet (STT) and Magpie (TTS, NVIDIA Riva) services on the Sparks: GET /v1/models — STT model + Magpie's 60+ voices POST /v1/audio/speech — OpenAI body -> Magpie multipart synthesize (returns audio/wav passthrough) POST /v1/audio/transcriptions — relay to Parakeet (already compatible) Verified shapes against the live services: - Parakeet returns OpenAI-style {"text": "..."} or verbose_json with segments+words. Already a perfect drop-in for OpenAI clients. - Magpie returns raw WAV bytes with Content-Type: audio/wav. NOT base64-wrapped JSON as one might assume. The proxy is literally a body-translation on the request side; response is passthrough. Voice language is auto-derived from the voice name (e.g. Magpie-Multilingual.EN-US.Mia -> language=en-US) so clients don't need to set it explicitly. Open WebUI / Home Assistant / Recap Relay can now all point at one URL — https://<spark-control>.local/v1 — and get LLM, STT, TTS behind a single identity. No shim service to deploy. Pure addition: no existing routes touched; the dashboard, /api/*, download flow, deep-health, hardware probes are all unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 16:41:48 -05:00

4 Commits