Channel-verified self identity: the mic track is you

Grant's insight + proven on real session audio: we capture self (mic) and others (system) as separate tracks, then throw the separation away by mixing to mono — so the backend has to re-guess who's who. Analysis of a real call showed the channels are cleanly separated (envelope corr 0.015, NO echo); Caitlyn's 'Go Bitcoin' was 11.8x louder in system than mic, yet the mono mix + noisy visual named it 'Grant'. ChannelSelfVAD marks self-speech as windows where the mic is active AND louder than system (mic > system x1.5). Benefits: (1) self is identified by CHANNEL, not by the on-screen name — set one name in Settings, no per-platform matching; (2) a remote speaker (or room echo) can never be mislabeled as self. Computed at finalize from the two finished WAVs; the live capture path is untouched. Falls back to mic-VAD if tracks can't be read. SessionController feeds these spans to the backend timeline. Validated on the real session: 16 self spans; 'Go Bitcoin' (72-74s) correctly EXCLUDED, Grant's 49.9-53.3s / 62.6-64s correctly INCLUDED. 33/33 XCTest (5 new).
2026-06-06 12:24:29 -05:00
parent 7f16b29f56
commit 2191486506
3 changed files with 146 additions and 2 deletions
@@ -277,16 +277,30 @@ final class SessionController: ObservableObject {
    private func stopVisualAndTimeline(_ result: RecordingResult, folder: URL?)
        async -> (timeline: [VisualTimeline.Segment], visualRan: Bool) {
        let selfName = settings.selfName
+        let selfSpans = await channelSelfSpans(result: result, folder: folder)
        if let vc = visualCapture, let folder {
            visualCapture = nil
            let timeline = await vc.finish(
-                selfSpans: result.selfSpans, selfName: selfName,
+                selfSpans: selfSpans, selfName: selfName,
                sessionId: folder.lastPathComponent, t0Unix: result.t0Unix,
                durationSec: result.duration, folder: folder)
            return (timeline, true)
        }
        if let vc = visualCapture { await vc.cancel(); visualCapture = nil }
-        return (TranscriptPipeline.timeline(fromSelfSpans: result.selfSpans, selfName: selfName), false)
+        return (TranscriptPipeline.timeline(fromSelfSpans: selfSpans, selfName: selfName), false)
+    }
+
+    /// Self spans for the backend timeline, identified by CHANNEL: the mic track is
+    /// the local user, so self = mic active AND louder than system. This makes self
+    /// platform-independent (one name, no display-name matching) and stops a remote
+    /// speaker from being mislabeled as self. Falls back to the mic-VAD spans if the
+    /// tracks can't be read. Runs off the main actor (file I/O).
+    private func channelSelfSpans(result: RecordingResult, folder: URL?) async -> [VADSpan] {
+        guard let folder else { return result.selfSpans }
+        let mic = folder.appendingPathComponent("mic.wav")
+        let sys = folder.appendingPathComponent("system.wav")
+        let spans = await Task.detached { ChannelSelfVAD.selfSpans(micURL: mic, systemURL: sys) }.value
+        return spans ?? result.selfSpans
    }

    private func stop() {