Channel-verified self identity: the mic track is you

Grant's insight + proven on real session audio: we capture self (mic) and others
(system) as separate tracks, then throw the separation away by mixing to mono — so
the backend has to re-guess who's who. Analysis of a real call showed the channels
are cleanly separated (envelope corr 0.015, NO echo); Caitlyn's 'Go Bitcoin' was
11.8x louder in system than mic, yet the mono mix + noisy visual named it 'Grant'.

ChannelSelfVAD marks self-speech as windows where the mic is active AND louder than
system (mic > system x1.5). Benefits: (1) self is identified by CHANNEL, not by the
on-screen name — set one name in Settings, no per-platform matching; (2) a remote
speaker (or room echo) can never be mislabeled as self. Computed at finalize from
the two finished WAVs; the live capture path is untouched. Falls back to mic-VAD if
tracks can't be read. SessionController feeds these spans to the backend timeline.

Validated on the real session: 16 self spans; 'Go Bitcoin' (72-74s) correctly
EXCLUDED, Grant's 49.9-53.3s / 62.6-64s correctly INCLUDED. 33/33 XCTest (5 new).
This commit is contained in:
Grant Gilliam
2026-06-06 12:24:29 -05:00
parent 7f16b29f56
commit 2191486506
3 changed files with 146 additions and 2 deletions
@@ -277,16 +277,30 @@ final class SessionController: ObservableObject {
private func stopVisualAndTimeline(_ result: RecordingResult, folder: URL?)
async -> (timeline: [VisualTimeline.Segment], visualRan: Bool) {
let selfName = settings.selfName
let selfSpans = await channelSelfSpans(result: result, folder: folder)
if let vc = visualCapture, let folder {
visualCapture = nil
let timeline = await vc.finish(
selfSpans: result.selfSpans, selfName: selfName,
selfSpans: selfSpans, selfName: selfName,
sessionId: folder.lastPathComponent, t0Unix: result.t0Unix,
durationSec: result.duration, folder: folder)
return (timeline, true)
}
if let vc = visualCapture { await vc.cancel(); visualCapture = nil }
return (TranscriptPipeline.timeline(fromSelfSpans: result.selfSpans, selfName: selfName), false)
return (TranscriptPipeline.timeline(fromSelfSpans: selfSpans, selfName: selfName), false)
}
/// Self spans for the backend timeline, identified by CHANNEL: the mic track is
/// the local user, so self = mic active AND louder than system. This makes self
/// platform-independent (one name, no display-name matching) and stops a remote
/// speaker from being mislabeled as self. Falls back to the mic-VAD spans if the
/// tracks can't be read. Runs off the main actor (file I/O).
private func channelSelfSpans(result: RecordingResult, folder: URL?) async -> [VADSpan] {
guard let folder else { return result.selfSpans }
let mic = folder.appendingPathComponent("mic.wav")
let sys = folder.appendingPathComponent("system.wav")
let spans = await Task.detached { ChannelSelfVAD.selfSpans(micURL: mic, systemURL: sys) }.value
return spans ?? result.selfSpans
}
private func stop() {