The backend shipped dual-channel mode; wire the client to it. We already capture
mic (you) and system (others) separately, so send them as two files instead of the
mono mix — fixing the misattribution at the source.
- SparkControlClient: labelMergeDual(mic_file, system_file, self_name, self_vad);
multipart generalized to N files; shared POST/retry/decode extracted.
- SessionPackager.rebasedSelfVadData: chunk-local [{start,end}] for self_vad;
sliceAudio reused for both tracks.
- TranscriptPipeline.process: dual-channel chunking (slice mic+system, rebase
timeline + self_vad per chunk) when system audio is healthy; mono mixed-file
fallback (self folded into the timeline) otherwise.
- VisualCapture.finish: write the full visual_timeline.json (remote + self merged)
but return REMOTE (vision) segments only — self travels via the mic channel.
- TranscriptAssembler: rank mic_channel highest (the user's own track wins).
- VoiceprintStore: store the clean mic_channel self voiceprint.
- SessionController: pass mic/system URLs + remote timeline + channel self-spans +
self_name + systemHealthy; self_vad.json now reflects the channel-verified spans.
Validated END-TO-END against the live backend on the real misattributing session:
'Go Bitcoin' (remote) is now attributed to Unknown_0, NOT the user; the user's own
lines come back source=mic_channel; per-channel ASR recovered fuller remote text.
36/36 XCTest (4 new: self_vad rebase, mic_channel ranking + voiceprint storage).
Grant's insight + proven on real session audio: we capture self (mic) and others
(system) as separate tracks, then throw the separation away by mixing to mono — so
the backend has to re-guess who's who. Analysis of a real call showed the channels
are cleanly separated (envelope corr 0.015, NO echo); Caitlyn's 'Go Bitcoin' was
11.8x louder in system than mic, yet the mono mix + noisy visual named it 'Grant'.
ChannelSelfVAD marks self-speech as windows where the mic is active AND louder than
system (mic > system x1.5). Benefits: (1) self is identified by CHANNEL, not by the
on-screen name — set one name in Settings, no per-platform matching; (2) a remote
speaker (or room echo) can never be mislabeled as self. Computed at finalize from
the two finished WAVs; the live capture path is untouched. Falls back to mic-VAD if
tracks can't be read. SessionController feeds these spans to the backend timeline.
Validated on the real session: 16 self spans; 'Go Bitcoin' (72-74s) correctly
EXCLUDED, Grant's 49.9-53.3s / 62.6-64s correctly INCLUDED. 33/33 XCTest (5 new).
Real Meet capture revealed the visual pipeline was treating ALL on-screen text as
participant names: meeting URL, clock, 'Add others' button, lobby 'Your meeting's
ready' dialog, 'Joined as …@gmail.com', etc. 46 of 52 'visual segments' in a real
session were phantom speakers. (The backend was unaffected — it diarizes from audio
and ignores names that match no voice cluster — but the visual_timeline.json and the
segment count were junk.)
GridCallAnalyzer.isLikelyName now gates OCR strings to things shaped like a name:
2–30 chars, 1–3 Title-Cased alphabetic words, no digits/URL/email/glyph punctuation.
Errs toward dropping (a missed name just loses a hint; audio diarization still runs).
Unit-tested against the EXACT 19 OCR strings from the real session: keeps the 5
real names, drops all 14 chrome strings. 28/28 XCTest.
Signal 1:1 (and group) calls didn't auto-record. Root cause confirmed on-device:
Signal is Electron and holds the mic in a HELPER process
(org.whispersystems.signal-desktop.helper.Renderer, a child of the main app).
detectViaMicAttribution only matched PIDs listed in NSWorkspace.runningApplications
against the main bundle ID, so the helper's mic use was never attributed to Signal.
(Zoom worked = single native process; Meet worked = browser resolved.)
Fix: iterate the mic-using PIDs and resolve each to its owning app by walking the
parent-process chain (sysctl KERN_PROC_PID → ppid) until an NSRunningApplication is
found. Helper PIDs that return nil directly now resolve to the main app. Validated
against the live Signal helpers: pids 2383/2372 → org.whispersystems.signal-desktop.
Superset of the old behavior, so Zoom/Meet detection is preserved (browser case now
also more robust); our own recording is still skipped (selfPID).
Cross-platform research (Grant) flagged that the colour-border cue differs by app;
checking the real brand colours against the detector found a concrete bug: the
global 0.5 saturation threshold MISSES Teams' violet ring (#6264A7 ≈ 0.41, light
variants ~0.27) entirely and Meet's lighter blue glow (#8ab4f8 ≈ 0.44). Those
adapters would have detected nothing.
- FrameSampler.saturatedPoints: add a tunable threshold + optional hue-band gate
(degrees) so a lowered threshold doesn't pick up warm video.
- GridCallAnalyzer.Config: colorSaturation / colorMinBrightness / colorHueRange,
plumbed to the colour-border path (defaults preserve prior behaviour).
- MeetAdapter sat→0.35 (catch the glow); TeamsAdapter sat→0.22 + hue 215–275°
(catch the faint violet, reject other colours); ZoomAdapter sat 0.45 + hue
40–150° (vivid green/yellow). Values are first-pass pending real-fixture
calibration; the hue gate is the main calibration lever.
Tests: Teams now detects the faint violet ring and rejects a green one; Meet/Zoom
vivid cases still pass. 27/27 XCTest.
Visual capture falls back to audio-only silently, so the user couldn't tell if
it attached on a real call. SessionInfo now carries visualSegmentCount (nil =
audio-only; a count = visual ran, with that many vision-detected speaker
segments), shown in the menu as '… · N visual segments' or '… · audio-only'.
Makes the pending live-call validation unambiguous.
Visual capture now runs alongside audio: on call start the session picks the
app's adapter, captures the call window on the SAME monotonic clock as the audio
(AudioRecorder.sharedT0Host), and on stop writes visual_timeline.json and hands
the backend the visual segments with mic-VAD self-spans merged. Any visual
failure (no adapter, no window, Screen Recording denied) leaves the session
recording audio-only — the proven path is never blocked or broken.
- CallDetector now emits DetectedCall{app, bundleID, windowID}: the exact
CGWindowID of the matched Meet browser window (native apps → nil → largest).
- VisualCapture wraps VisualObserver + AdapterRegistry, writes visual_timeline.json.
- AudioRecorder.sharedT0Host() exposes the shared t0 for frame alignment.
Hardened per a 3-lens adversarial review (concurrency / failure-isolation /
data-flow), all 6 confirmed findings fixed:
- P0 (critical): startVisual could adopt a stale capture into a DIFFERENT session
(cross-session SCStream leak + visual_timeline.json written to the wrong
folder). Now gated on session identity — generation + recorder ===, still
.recording — with fail-closed adoption; otherwise the stream is cancelled.
- P1: observer captured the browser's largest window, not the detected Meet
window. Now targets the exact CGWindowID (pickWindowIndex, unit-tested),
largest-area only as fallback.
- P2: a startVisual orphaned by a concurrent stop could leak a stream on quit.
inFlightVisual is registered before the await and drained in prepareForTermination.
- P3: trailing visual gap/segment ends could exceed duration_sec. Clamped in
VisualCapture (clampSegments/clampGaps, unit-tested).
- P4: capture pixel size used NSScreen.main scale; now uses the scale of the
display actually hosting the window (OCR clarity on secondary displays).
- VisualObserver.stop() bounds stopCapture() with a 3s timeout (mirrors audio) so
a wedged stream can't hang finalization.
25/25 XCTest pass. Live validation on real calls still pending.
Front-loads the remaining visual adapters per the Signal→Meet→Zoom priority.
All three reuse GridCallAnalyzer's coloured-border (saturated) detection path
and share the new bottom-left name anchor:
- GridCallAnalyzer: generalise nameAtBottom:Bool into a NameAnchor enum
(.bottomCenter for Signal's centered footer, .bottomLeft for Meet/Zoom/Teams
where the name hugs the tile's bottom-left corner, .center for completeness).
tileRect estimates the tile up-and-right of a bottom-left name.
- MeetAdapter (Google-blue ring, browser-hosted), ZoomAdapter (green/yellow
border), TeamsAdapter (violet ring): coloured-border on, white-border off,
bottom-left names. Geometry constants are first-pass pending real fixtures.
- AdapterRegistry.adapter(for:) maps CallDetector.DetectedApp → AppAdapter so
VisualObserver can be constructed when live visual capture is wired in;
unmapped apps degrade to audio-only.
Synthetic 4-tile tests: Meet picks each blue-bordered speaker with no
adjacent-tile bleed, Zoom picks the green-bordered speaker, and Signal's
white-only detector correctly ignores a coloured border. 18/18 XCTest pass.
Signal's active-speaker cue is a 3px #ffffff rounded border (saturation ≈ 0),
which the saturation-based highlight detector could never see. Per the
Signal-Desktop source review:
- FrameSampler.thinWhitePoints: grid-sample near-white pixels that sit on a
THIN structure (a non-white pixel within edgeGap on some axis) so a border/
ring counts but a solid white blob (face, bright video) does not.
- GridCallAnalyzer: combine coloured (saturated) + white (thin) highlight
pixels; exclude name-text regions so the white footer name can't be mistaken
for the border; estimate the tile UP from the name footer (nameAtBottom);
attribute each highlight pixel to exactly one tile by containment (nearest
centre as tiebreak) so a border can't bleed into an adjacent tile.
- SignalAdapter: white border on, coloured off, name-at-bottom geometry.
Synthetic 4-tile harness now isolates each speaker with no adjacent-tile bleed;
all 15 XCTest cases pass. Real-screenshot geometry calibration still pending.
AudioRecorder captures system audio (ScreenCaptureKit) + mic (AVAudioEngine) on a
single serial ioQueue, one shared monotonic t0, time-driven writers (pad gaps /
trim overlaps) so tracks stay aligned, and an energy mic-VAD for 'self' spans.
AudioMixer sums the aligned tracks into mixed_mono_16k.wav. SessionController
drives a serialized start/stop state machine, writes the session folder +
self_vad.json, exposes live level meters, and finalizes on quit.
Hardening from review: ioQueue single-domain (no races), stop() never hangs
(mic-first teardown + bounded stopCapture), layout-agnostic mic deep-copy,
discard-only video output to keep SCStream alive, VAD lockstep on committed
frames, stable signing team in project.yml, single-instance enforcement.