v0.13.0:2 - per-segment confidence in diarize-chunk response
Recap Relay dev asked: can the diarization output include a confidence
level per segment so the UI can render "Speaker_0?" for uncertain
assignments rather than confidently mislabeling?
Answer: yes. Sortformer's diarize() with include_tensor_outputs=True
returns the per-frame per-speaker sigmoid scores (shape [B, T, 4spk],
~12.6 fps frame rate). The current code argmaxes those into segment
strings and throws the raw scores away. Now: for each output segment,
compute mean probability of the assigned speaker across the segment's
frames → confidence in [0, 1].
Implementation:
- diarizer.py: diarize_chunk() now calls diarize() with
include_tensor_outputs=True, and a new _attach_confidence() helper
derives the per-segment mean probability after parsing the segment
strings. The frame-rate is computed from tensor shape vs audio
duration (no need to hard-code the model's stride).
- All failure paths return confidence=None gracefully — Recap Relay
can treat None as "no info" or fall back to a default threshold.
Endpoint shape change: segments[] now have an optional `confidence`
field in [0, 1] (or None). All other fields unchanged. Existing callers
that ignore the field aren't affected.
Verified with a 5s test signal that the tensor has shape [1, 63, 4]
(63 frames / 5s = 12.6 fps) and values in [0, 1] (sigmoid outputs,
independent per speaker so overlap detection works). Real speech values
will be much higher than the near-zero values of the pure-tone test
signal.
Reapply patches on the Speech Models card after installing v0.13.0:2
to pick up the updated diarizer.py + main.py in the parakeet container.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -175,7 +175,10 @@ async def diarize_chunk(
|
||||
Response shape:
|
||||
{
|
||||
"duration": 300.0,
|
||||
"segments": [{"start_s": 1.2, "end_s": 4.8, "speaker": "Speaker_0"}, ...],
|
||||
"segments": [
|
||||
{"start_s": 1.2, "end_s": 4.8, "speaker": "Speaker_0", "confidence": 0.78},
|
||||
...
|
||||
],
|
||||
"speakers_detected": ["Speaker_0", "Speaker_1", "Speaker_2"],
|
||||
"fingerprints": {
|
||||
"Speaker_0": [0.123, -0.045, ..., 0.211], # 192-dim TitaNet embedding
|
||||
@@ -188,6 +191,13 @@ async def diarize_chunk(
|
||||
}
|
||||
}
|
||||
|
||||
confidence per segment: mean probability that the assigned speaker was
|
||||
active across the segment's frames (Sortformer's raw per-frame per-
|
||||
speaker sigmoid outputs). Range [0, 1], higher = more confident.
|
||||
Clean speech typically >0.5; ambiguous regions (overlap, weak signal)
|
||||
fall lower. None on derivation failure. Recap Relay can threshold
|
||||
this to render uncertain segments as "Speaker_0?" in the UI.
|
||||
|
||||
Speaker labels are LOCAL to this chunk. Run cosine-similarity clustering
|
||||
across the fingerprints from all chunks to merge `chunkA.Speaker_0` with
|
||||
`chunkB.Speaker_2` when they're the same voice. Recommended threshold:
|
||||
|
||||
Reference in New Issue
Block a user