8 Commits

Author SHA1 Message Date
Keysat 8d839e3714 v0.13.0:4 - redaction gateway, embeddings proxy, expanded audio API
- Add redaction gateway (redaction_gateway.py, redaction/ scrub + tests)
- Add embeddings proxy and spark_embed service (Dockerfile + main.py)
- Expand audio_proxy with speaker-aware handling; deep_health/health/server updates
- Package: configureSparks action + sparkConfig model updates, manifest/main wiring
- Docs: AUDIO_API, EMBEDDINGS, REDACTION_GATEWAY; HANDOFF and runbook/known-issues refresh
2026-06-11 17:45:57 -05:00
Keysat e775906caa v0.13.0:1 - per-chunk diarization worker with TitaNet voice fingerprints
Spark Control now exposes a per-chunk worker designed for Recap Relay
to orchestrate against. Recap Relay does the chunking + global speaker
clustering (consistent with how it already handles the Gemini path);
Spark Control handles the GPU-bound per-chunk work.

Parakeet container:
  - diarizer.py: now also loads NVIDIA TitaNet speaker-verification model
    (~25 MB, NeMo-native, no torchaudio). New diarize_chunk() method
    runs Sortformer + extracts one 192-dim voice fingerprint per detected
    local speaker (concatenating each speaker's audio across the chunk
    and running TitaNet's get_embedding).
  - main.py: new POST /v1/audio/diarize-chunk endpoint that returns
    segments + speakers_detected + fingerprints + models in one shot.

Spark Control:
  - new POST /api/audio/diarize-chunk that proxies to parakeet's new
    endpoint. Same CUDA-wedge recovery (503 + deep-health probe + 60s
    retry-after) as the other audio endpoints. Returns the raw JSON
    upstream because Recap Relay is the consumer; no merging needed.

Response shape Recap Relay receives per chunk:
  {
    "duration": 300.0,
    "segments":  [{"start_s","end_s","speaker"}, ...],   # LOCAL labels
    "speakers_detected": ["Speaker_0","Speaker_1",...],
    "fingerprints": {"Speaker_0":[192 floats], ...},
    "models": {"diarization":"...","embedding":"..."}
  }

Recap Relay's job:
  1. Chunk audio (existing chunking infrastructure)
  2. POST each chunk to /api/audio/diarize-chunk in parallel
  3. Collect all fingerprints from all chunks
  4. sklearn AgglomerativeClustering(distance_threshold=0.7, metric=cosine)
  5. Re-label segments with global cluster IDs
  6. Concatenate transcripts (from a separate parallel call to
     /v1/audio/transcriptions) with timestamp offsets and merge with
     re-labeled diar segments

After installing v0.13.0:1, click "Reapply patches" on the Speech Models
card to push the updated diarizer.py + main.py into the parakeet
container — TitaNet will download (~25 MB) on first call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 11:37:05 -05:00
Keysat 95524f4983 v0.13.0:0 - revert WhisperX migration; back to Parakeet + Sortformer
After five hotfix iterations on the WhisperX install (v0.12.0:0–:4) we
never got a working docker build. The fundamental constraint isn't
patchable from outside NVIDIA: NGC PyTorch on ARM64 (the only base that
runs on Spark 2's GB10 Blackwell) ships a custom-versioned torch
2.10.0a0+b558c98 that has no pre-built torchaudio match anywhere.
WhisperX → pyannote → torchaudio is a hard dependency chain we couldn't
satisfy without rebuilding torchaudio against torch 2.10's alpha API.
Walking away cleanly is better than another night of chasing.

Removed from the codebase:
  - image/whisperx_container/* (Dockerfile + requirements + app/main.py)
  - image/app/whisperx_install.py (install manager + SSH ship-context logic)
  - image/Dockerfile COPY whisperx_container
  - WHISPERX_* config keys in config.py
  - whisperx service entry in services.py
  - WhisperX-preferred branch in audio_proxy.py
  - /api/whisperx/* endpoints in server.py
  - install banner + progress dialog in index.html
  - render + handlers in app.js
  - .whisperx-install styles in style.css

Spark 2 cleaned in tandem (user-authorized): container removed,
~/whisperx-build/ removed, 5.4 GB of dangling image layers + 1.3 GB of
builder cache reclaimed. parakeet-asr and magpie-tts unaffected and
healthy throughout.

The audio path is back to exactly what shipped in v0.11.0:3:
  POST /api/audio/transcribe-with-speakers
    → Parakeet (transcription) + Sortformer (diarization) in parallel
    → merged by timestamp into speaker-labeled blocks

v0.13.0:1+ will add the actually-needed fixes that the WhisperX detour
was meant to address:
  1. memory cap on the parakeet-asr container so a long-audio crash
     can't swap-thrash Spark 2 again
  2. a chunking proxy in /api/audio/transcribe-with-speakers that
     splits inputs >10 min before Sortformer

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 08:03:19 -05:00
Keysat 5a0bfba6a3 v0.12.0:0 - WhisperX as a one-click dashboard install + managed service
Replaces the manual rsync+build+run with a proper spark-control feature.
First in the audio path that doesn't require shell access on Spark 2.

What's in the box
─────────────────
* image/whisperx_container/   - the build context (Dockerfile, requirements,
  app/main.py FastAPI wrapper). Mainline pipeline: faster-whisper for STT +
  pyannote 3.1 for diarization + wav2vec2 forced alignment. Single endpoint
  /v1/audio/transcribe-with-speakers returns the exact same shape spark-
  control's existing endpoint does, so the recap-relay PR spec needs no
  changes when we cut over.

* image/app/whisperx_install.py - install manager. ships build context to
  Spark 2 over SSH, runs `docker build`, runs `docker run` with 40 GB
  memory cap (vs Sortformer's unbounded which thrashed Spark 2 on a 90-min
  file), polls /health until both Whisper + pyannote report loaded.

* Audio proxy: /api/audio/transcribe-with-speakers now prefers WhisperX
  when its /health reports diarizer_loaded=true, falls back to the legacy
  Parakeet + Sortformer path otherwise. Same response shape either way.
  Clean cutover, easy rollback (`docker rm whisperx-asr`).

* Dashboard (Audio / Speech tab):
  - "Add WhisperX" banner appears when not installed, with a primary
    "Install WhisperX" button. One click triggers the install.
  - Build progress dialog with phase + elapsed timer + live build log via
    SSE (`/api/whisperx/install/{job_id}/stream`).
  - After install, WhisperX auto-registers as a managed service alongside
    Parakeet and Magpie (Start/Restart/Stop, deep-check, auto-restart).
  - Banner self-hides once /api/whisperx/status reports healthy.

New endpoints
─────────────
  GET  /api/whisperx/status
  POST /api/whisperx/install
  GET  /api/whisperx/install/{job_id}
  GET  /api/whisperx/install/{job_id}/stream  (SSE phase + log)

Config additions (env)
──────────────────────
  WHISPERX_HOST       (defaults to spark2_host)
  WHISPERX_USER       (defaults to spark2_user)
  WHISPERX_CONTAINER  (default: whisperx-asr)
  WHISPERX_PORT       (default: 8002)
  WHISPERX_MODEL      (default: medium; tiny/base/small/medium/large-v3)

Dockerfile
──────────
Added COPY whisperx_container /app/whisperx_container so the runtime
install manager can read the build context from inside the spark-control
image and ship it over SSH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:02:26 -05:00
Keysat fda23088fe v0.10.0:1 - hotfix: merge function now joins words with proper spacing
Smoke testing v0.10.0:0 against a real anarlog audio.mp3 showed the
output running words together: "I'mrecordingrightnow", "don'tyoutry".

Root cause: _merge_words_with_speakers was doing "".join(cur_words),
assuming Parakeet returns words with leading whitespace (which the
hyprnote local Parakeet does, but the Spark-hosted Parakeet does not).

Rewrote the join with a small helper that:
  - Strips each token (handles both leading-space and no-leading-space
    word formats)
  - Joins with a single space
  - Keeps punctuation tight — no space before period/comma/colon/etc.

Verified post-install with the same test audio:
  [00:06] Speaker_0: I'm I'm recording right now.
  [00:18] Speaker_1: you're you're on your computer and your phone, right?

No other changes — Parakeet container patches and the endpoint shape
stay identical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:42:04 -05:00
Keysat 713cd09cc2 v0.10.0:0 - speaker diarization via Sortformer + merged transcribe-with-speakers
Adds a new pipeline for diarized transcription that any client (recap-relay,
ad-hoc curl, future Mac-side tools) can call. Pure data pipeline, no LLM
or UI included — name resolution / analysis happen downstream where prompts
and rendering are configurable.

Architecture:
  Spark 2 / parakeet-asr container:
    + /opt/parakeet/app/diarizer.py        (new: SortformerDiarizer class)
    + /opt/parakeet/app/main.py            (patched: loads diarizer, adds
                                            /v1/audio/diarize endpoint)
    Model: nvidia/diar_sortformer_4spk-v1  (~150 MB, ungated, NeMo native)

  Spark Control:
    + POST /api/audio/transcribe-with-speakers
      Body: multipart file
      Returns: {
        duration, language, speakers_detected,
        segments: [{start_ms, end_ms, speaker, text}, ...],
        models: {transcription, diarization}
      }
      Runs Parakeet ASR + Sortformer in parallel, merges words to speaker
      turns by timestamp, groups into speaker-change blocks (breaks also
      on >1.5s silence gaps).
    + If Parakeet 500s mid-pipeline, kicks deep-health probe and returns
      503/Retry-After: 60 — same wedge-recovery pattern as v0.9.0:2.

Apply Sortformer patches to the running Parakeet container with:
  bash image/parakeet_patches/apply.sh <spark2-host> <ssh-user>

Patches are reversible — apply.sh backs up the original main.py inside the
container at main.py.pre-sortformer before overwriting. Restore by copying
that file back and removing diarizer.py, then docker restart.

v0.11 follow-up: dashboard "Speech Models" panel to swap/update model
versions from the UI instead of needing to re-run apply.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:14:48 -05:00
Keysat 197655a62b v0.9.0:2 - audio proxy: turn Parakeet wedge 500 into clean 503 + immediate auto-restart
Parakeet's recurring CUDA wedge (CUBLAS_STATUS_*_ERROR mid-attention)
fires reliably on Open WebUI's WebM/Opus->MP3 audio. Previously the
proxy relayed the upstream 500 verbatim, Open WebUI showed "Server
connection error" with no signal to retry, and recovery took up to
5 minutes (waiting for the next periodic deep-health probe).

Now the proxy:
  1. Detects 500 from /v1/audio/transcriptions
  2. Fires deep_health.run_one("parakeet") as a background asyncio task
     (which contains the same wedge-detect + rate-limited auto-restart
     logic, but runs immediately instead of waiting for the next tick)
  3. Returns 503 with a clear detail message and Retry-After: 60

The client (Open WebUI, Home Assistant, etc.) gets a proper retry
signal; the auto-restart triggers inside seconds; the next attempt
~60s later succeeds. Rate-limiting (3 restarts per 30 min) is
inherited from the deep-health module so this can't cause restart
storms.

server.py: pass deep_health into build_audio_router().
audio_proxy.py: new 503-with-restart branch; signature now accepts
                deep_health as an optional dependency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 18:07:35 -05:00
Keysat f44e7f8b03 v0.9.0:0 - OpenAI-compatible audio proxy for Open WebUI / Home Assistant
Adds three new endpoints to spark-control that translate OpenAI's
audio API shapes to the Parakeet (STT) and Magpie (TTS, NVIDIA Riva)
services on the Sparks:

  GET  /v1/models                — STT model + Magpie's 60+ voices
  POST /v1/audio/speech          — OpenAI body -> Magpie multipart synthesize
                                    (returns audio/wav passthrough)
  POST /v1/audio/transcriptions  — relay to Parakeet (already compatible)

Verified shapes against the live services:
  - Parakeet returns OpenAI-style {"text": "..."} or verbose_json with
    segments+words. Already a perfect drop-in for OpenAI clients.
  - Magpie returns raw WAV bytes with Content-Type: audio/wav. NOT
    base64-wrapped JSON as one might assume. The proxy is literally a
    body-translation on the request side; response is passthrough.

Voice language is auto-derived from the voice name (e.g.
Magpie-Multilingual.EN-US.Mia -> language=en-US) so clients don't
need to set it explicitly.

Open WebUI / Home Assistant / Recap Relay can now all point at one
URL — https://<spark-control>.local/v1 — and get LLM, STT, TTS
behind a single identity. No shim service to deploy.

Pure addition: no existing routes touched; the dashboard, /api/*,
download flow, deep-health, hardware probes are all unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 16:41:48 -05:00