Grant's insight + proven on real session audio: we capture self (mic) and others
(system) as separate tracks, then throw the separation away by mixing to mono — so
the backend has to re-guess who's who. Analysis of a real call showed the channels
are cleanly separated (envelope corr 0.015, NO echo); Caitlyn's 'Go Bitcoin' was
11.8x louder in system than mic, yet the mono mix + noisy visual named it 'Grant'.
ChannelSelfVAD marks self-speech as windows where the mic is active AND louder than
system (mic > system x1.5). Benefits: (1) self is identified by CHANNEL, not by the
on-screen name — set one name in Settings, no per-platform matching; (2) a remote
speaker (or room echo) can never be mislabeled as self. Computed at finalize from
the two finished WAVs; the live capture path is untouched. Falls back to mic-VAD if
tracks can't be read. SessionController feeds these spans to the backend timeline.
Validated on the real session: 16 self spans; 'Go Bitcoin' (72-74s) correctly
EXCLUDED, Grant's 49.9-53.3s / 62.6-64s correctly INCLUDED. 33/33 XCTest (5 new).
Real Meet capture revealed the visual pipeline was treating ALL on-screen text as
participant names: meeting URL, clock, 'Add others' button, lobby 'Your meeting's
ready' dialog, 'Joined as …@gmail.com', etc. 46 of 52 'visual segments' in a real
session were phantom speakers. (The backend was unaffected — it diarizes from audio
and ignores names that match no voice cluster — but the visual_timeline.json and the
segment count were junk.)
GridCallAnalyzer.isLikelyName now gates OCR strings to things shaped like a name:
2–30 chars, 1–3 Title-Cased alphabetic words, no digits/URL/email/glyph punctuation.
Errs toward dropping (a missed name just loses a hint; audio diarization still runs).
Unit-tested against the EXACT 19 OCR strings from the real session: keeps the 5
real names, drops all 14 chrome strings. 28/28 XCTest.
Cross-platform research (Grant) flagged that the colour-border cue differs by app;
checking the real brand colours against the detector found a concrete bug: the
global 0.5 saturation threshold MISSES Teams' violet ring (#6264A7 ≈ 0.41, light
variants ~0.27) entirely and Meet's lighter blue glow (#8ab4f8 ≈ 0.44). Those
adapters would have detected nothing.
- FrameSampler.saturatedPoints: add a tunable threshold + optional hue-band gate
(degrees) so a lowered threshold doesn't pick up warm video.
- GridCallAnalyzer.Config: colorSaturation / colorMinBrightness / colorHueRange,
plumbed to the colour-border path (defaults preserve prior behaviour).
- MeetAdapter sat→0.35 (catch the glow); TeamsAdapter sat→0.22 + hue 215–275°
(catch the faint violet, reject other colours); ZoomAdapter sat 0.45 + hue
40–150° (vivid green/yellow). Values are first-pass pending real-fixture
calibration; the hue gate is the main calibration lever.
Tests: Teams now detects the faint violet ring and rejects a green one; Meet/Zoom
vivid cases still pass. 27/27 XCTest.
Visual capture now runs alongside audio: on call start the session picks the
app's adapter, captures the call window on the SAME monotonic clock as the audio
(AudioRecorder.sharedT0Host), and on stop writes visual_timeline.json and hands
the backend the visual segments with mic-VAD self-spans merged. Any visual
failure (no adapter, no window, Screen Recording denied) leaves the session
recording audio-only — the proven path is never blocked or broken.
- CallDetector now emits DetectedCall{app, bundleID, windowID}: the exact
CGWindowID of the matched Meet browser window (native apps → nil → largest).
- VisualCapture wraps VisualObserver + AdapterRegistry, writes visual_timeline.json.
- AudioRecorder.sharedT0Host() exposes the shared t0 for frame alignment.
Hardened per a 3-lens adversarial review (concurrency / failure-isolation /
data-flow), all 6 confirmed findings fixed:
- P0 (critical): startVisual could adopt a stale capture into a DIFFERENT session
(cross-session SCStream leak + visual_timeline.json written to the wrong
folder). Now gated on session identity — generation + recorder ===, still
.recording — with fail-closed adoption; otherwise the stream is cancelled.
- P1: observer captured the browser's largest window, not the detected Meet
window. Now targets the exact CGWindowID (pickWindowIndex, unit-tested),
largest-area only as fallback.
- P2: a startVisual orphaned by a concurrent stop could leak a stream on quit.
inFlightVisual is registered before the await and drained in prepareForTermination.
- P3: trailing visual gap/segment ends could exceed duration_sec. Clamped in
VisualCapture (clampSegments/clampGaps, unit-tested).
- P4: capture pixel size used NSScreen.main scale; now uses the scale of the
display actually hosting the window (OCR clarity on secondary displays).
- VisualObserver.stop() bounds stopCapture() with a 3s timeout (mirrors audio) so
a wedged stream can't hang finalization.
25/25 XCTest pass. Live validation on real calls still pending.
Front-loads the remaining visual adapters per the Signal→Meet→Zoom priority.
All three reuse GridCallAnalyzer's coloured-border (saturated) detection path
and share the new bottom-left name anchor:
- GridCallAnalyzer: generalise nameAtBottom:Bool into a NameAnchor enum
(.bottomCenter for Signal's centered footer, .bottomLeft for Meet/Zoom/Teams
where the name hugs the tile's bottom-left corner, .center for completeness).
tileRect estimates the tile up-and-right of a bottom-left name.
- MeetAdapter (Google-blue ring, browser-hosted), ZoomAdapter (green/yellow
border), TeamsAdapter (violet ring): coloured-border on, white-border off,
bottom-left names. Geometry constants are first-pass pending real fixtures.
- AdapterRegistry.adapter(for:) maps CallDetector.DetectedApp → AppAdapter so
VisualObserver can be constructed when live visual capture is wired in;
unmapped apps degrade to audio-only.
Synthetic 4-tile tests: Meet picks each blue-bordered speaker with no
adjacent-tile bleed, Zoom picks the green-bordered speaker, and Signal's
white-only detector correctly ignores a coloured border. 18/18 XCTest pass.
Signal's active-speaker cue is a 3px #ffffff rounded border (saturation ≈ 0),
which the saturation-based highlight detector could never see. Per the
Signal-Desktop source review:
- FrameSampler.thinWhitePoints: grid-sample near-white pixels that sit on a
THIN structure (a non-white pixel within edgeGap on some axis) so a border/
ring counts but a solid white blob (face, bright video) does not.
- GridCallAnalyzer: combine coloured (saturated) + white (thin) highlight
pixels; exclude name-text regions so the white footer name can't be mistaken
for the border; estimate the tile UP from the name footer (nameAtBottom);
attribute each highlight pixel to exactly one tile by containment (nearest
centre as tiebreak) so a border can't bleed into an adjacent tile.
- SignalAdapter: white border on, coloured off, name-at-bottom geometry.
Synthetic 4-tile harness now isolates each speaker with no adjacent-tile bleed;
all 15 XCTest cases pass. Real-screenshot geometry calibration still pending.