Make diarization chunk length configurable (Auto + presets)

Chunk size was hardcoded at 2.5-min bodies. Add a Settings control:
Auto / Standard 2.5min / Large group 60s / Fine 90s. Shorter chunks keep fewer
simultaneous speakers per window (Sortformer resolves ~4/chunk), useful for large
calls, at some cost to speed and cross-chunk voice matching.

- ChunkMode (new, pure/testable): mode → body seconds; Auto picks 60s when >4
  participants were detected, else 150s; overlap + single-chunk threshold scale
  with the body length.
- AppSettings.chunkMode (+ typed `chunk`); SettingsView picker with explanation.
- TranscriptPipeline.process gains chunkSeconds; derives overlap/threshold from it.
- SessionController resolves the body from the setting + the session's detected
  participant count (visual_timeline participants) for both send + re-process.
- Participant roster now counts EVERY tile OCR'd, not just who spoke
  (TimelineBuilder.observedNames → VisualObserver → VisualCapture), so the Auto
  call-size signal is meaningful even though speaking-detection is sparse.

Tests: ChunkMode resolution, overlap scaling, short-body re-chunking. 69 pass.
This commit is contained in:
Grant Gilliam
2026-06-09 10:15:16 -05:00
parent 3bb7f1ab32
commit a3e3406b28
9 changed files with 133 additions and 3 deletions
@@ -15,9 +15,15 @@ final class TimelineBuilder {
private let closeFrames: Int
private var aliases: [String: String] = [:] // normalized variant -> canonical
private var states: [String: NameState] = [:]
private var observed: Set<String> = [] // every tile name seen (speaking or not)
private var lastFrameT: Double = 0
private(set) var segments: [VisualTimeline.Segment] = []
/// Every distinct participant name the adapter has OCR'd, whether or not they were
/// ever detected speaking the call-size signal (drives "Auto" chunk sizing and a
/// complete participant roster, since speaking-detection is intentionally sparse).
var observedNames: [String] { observed.sorted() }
init(openFrames: Int = 2, closeFrames: Int = 2) {
self.openFrames = max(1, openFrames)
self.closeFrames = max(1, closeFrames)
@@ -34,6 +40,9 @@ final class TimelineBuilder {
func ingest(_ observations: [SpeakerObservation], at t: TimeInterval) {
lastFrameT = t
// Record every tile seen (speaking or not) for the participant roster / call size.
for obs in observations where !obs.name.isEmpty { observed.insert(canonical(obs.name)) }
// Best confidence per canonical name that is speaking this frame.
var speaking: [String: Double] = [:]
for obs in observations where obs.speaking && !obs.name.isEmpty {
+4 -1
View File
@@ -75,7 +75,10 @@ final class VisualCapture {
}, to: durationSec)
let artifact = (vision + selfSegs).sorted { $0.start < $1.start }
let names = Set(artifact.map { $0.name })
// Roster = everyone OCR'd (speaking or not) the names that produced segments,
// so the participant count reflects true call size even when few people were
// detected speaking. Drives "Auto" chunk sizing downstream.
let names = Set(artifact.map { $0.name }).union(observer.participantNames())
let participants = names.sorted().map {
VisualTimeline.Participant(name: $0, isSelf: $0 == selfName ? true : nil, aliases: nil)
}
@@ -114,6 +114,10 @@ final class VisualObserver: NSObject, SCStreamDelegate, SCStreamOutput {
queue.sync { builder.mergeSelfSpans(spans, selfName: selfName) }
}
/// Every distinct participant name OCR'd over the session (read on the builder's
/// queue; safe to call after `stop`).
func participantNames() -> [String] { queue.sync { builder.observedNames } }
// MARK: - SCStreamOutput (on `queue`)
func stream(_ stream: SCStream, didOutputSampleBuffer sampleBuffer: CMSampleBuffer,