Recap Relay dev asked: can the diarization output include a confidence
level per segment so the UI can render "Speaker_0?" for uncertain
assignments rather than confidently mislabeling?
Answer: yes. Sortformer's diarize() with include_tensor_outputs=True
returns the per-frame per-speaker sigmoid scores (shape [B, T, 4spk],
~12.6 fps frame rate). The current code argmaxes those into segment
strings and throws the raw scores away. Now: for each output segment,
compute mean probability of the assigned speaker across the segment's
frames → confidence in [0, 1].
Implementation:
- diarizer.py: diarize_chunk() now calls diarize() with
include_tensor_outputs=True, and a new _attach_confidence() helper
derives the per-segment mean probability after parsing the segment
strings. The frame-rate is computed from tensor shape vs audio
duration (no need to hard-code the model's stride).
- All failure paths return confidence=None gracefully — Recap Relay
can treat None as "no info" or fall back to a default threshold.
Endpoint shape change: segments[] now have an optional `confidence`
field in [0, 1] (or None). All other fields unchanged. Existing callers
that ignore the field aren't affected.
Verified with a 5s test signal that the tensor has shape [1, 63, 4]
(63 frames / 5s = 12.6 fps) and values in [0, 1] (sigmoid outputs,
independent per speaker so overlap detection works). Real speech values
will be much higher than the near-zero values of the pure-tone test
signal.
Reapply patches on the Speech Models card after installing v0.13.0:2
to pick up the updated diarizer.py + main.py in the parakeet container.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spark Control now exposes a per-chunk worker designed for Recap Relay
to orchestrate against. Recap Relay does the chunking + global speaker
clustering (consistent with how it already handles the Gemini path);
Spark Control handles the GPU-bound per-chunk work.
Parakeet container:
- diarizer.py: now also loads NVIDIA TitaNet speaker-verification model
(~25 MB, NeMo-native, no torchaudio). New diarize_chunk() method
runs Sortformer + extracts one 192-dim voice fingerprint per detected
local speaker (concatenating each speaker's audio across the chunk
and running TitaNet's get_embedding).
- main.py: new POST /v1/audio/diarize-chunk endpoint that returns
segments + speakers_detected + fingerprints + models in one shot.
Spark Control:
- new POST /api/audio/diarize-chunk that proxies to parakeet's new
endpoint. Same CUDA-wedge recovery (503 + deep-health probe + 60s
retry-after) as the other audio endpoints. Returns the raw JSON
upstream because Recap Relay is the consumer; no merging needed.
Response shape Recap Relay receives per chunk:
{
"duration": 300.0,
"segments": [{"start_s","end_s","speaker"}, ...], # LOCAL labels
"speakers_detected": ["Speaker_0","Speaker_1",...],
"fingerprints": {"Speaker_0":[192 floats], ...},
"models": {"diarization":"...","embedding":"..."}
}
Recap Relay's job:
1. Chunk audio (existing chunking infrastructure)
2. POST each chunk to /api/audio/diarize-chunk in parallel
3. Collect all fingerprints from all chunks
4. sklearn AgglomerativeClustering(distance_threshold=0.7, metric=cosine)
5. Re-label segments with global cluster IDs
6. Concatenate transcripts (from a separate parallel call to
/v1/audio/transcriptions) with timestamp offsets and merge with
re-labeled diar segments
After installing v0.13.0:1, click "Reapply patches" on the Speech Models
card to push the updated diarizer.py + main.py into the parakeet
container — TitaNet will download (~25 MB) on first call.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new pipeline for diarized transcription that any client (recap-relay,
ad-hoc curl, future Mac-side tools) can call. Pure data pipeline, no LLM
or UI included — name resolution / analysis happen downstream where prompts
and rendering are configurable.
Architecture:
Spark 2 / parakeet-asr container:
+ /opt/parakeet/app/diarizer.py (new: SortformerDiarizer class)
+ /opt/parakeet/app/main.py (patched: loads diarizer, adds
/v1/audio/diarize endpoint)
Model: nvidia/diar_sortformer_4spk-v1 (~150 MB, ungated, NeMo native)
Spark Control:
+ POST /api/audio/transcribe-with-speakers
Body: multipart file
Returns: {
duration, language, speakers_detected,
segments: [{start_ms, end_ms, speaker, text}, ...],
models: {transcription, diarization}
}
Runs Parakeet ASR + Sortformer in parallel, merges words to speaker
turns by timestamp, groups into speaker-change blocks (breaks also
on >1.5s silence gaps).
+ If Parakeet 500s mid-pipeline, kicks deep-health probe and returns
503/Retry-After: 60 — same wedge-recovery pattern as v0.9.0:2.
Apply Sortformer patches to the running Parakeet container with:
bash image/parakeet_patches/apply.sh <spark2-host> <ssh-user>
Patches are reversible — apply.sh backs up the original main.py inside the
container at main.py.pre-sortformer before overwriting. Restore by copying
that file back and removing diarizer.py, then docker restart.
v0.11 follow-up: dashboard "Speech Models" panel to swap/update model
versions from the UI instead of needing to re-run apply.sh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>