Fragments reabsorbed by smoothFragments (e.g. "I" then "need to switch it
back") were left as separate transcript lines. Add SpeakerReconciler.mergeAdjacent
to join consecutive same-speaker segments within 2s, concatenating their text.
Wire it into SessionController.finishBackend AFTER reconcile/LLM naming. The
collapse needs no LLM, so finishBackend no longer early-returns when the gateway
has no chat model — it runs the collapse and re-persists speakers.json
unconditionally, gating only the reconcile and recap passes on the model.
Chunks were contiguous (start = prev end) with a naïve offset-concat stitch — no
overlap. That cut sentences at boundaries, denied the diarizer context at edges, and
let one voice split across chunks (the MH/Unknown_0 problem). Now each ~150s body is
sliced with a 15s margin on both sides ([bodyStart-15, bodyEnd+15]); the stitcher
keeps a segment only in the chunk that owns its MIDPOINT (body region) and drops it
from the neighbour's margin — so boundary-spanning speech is seen whole by the
backend and kept exactly once.
- SessionPackager.PlannedChunk gains bodyStart/bodyEnd; planChunks adds overlapSeconds.
- TranscriptAssembler.ChunkResult carries body bounds (defaults keep-all → no-overlap
behaviour preserved for existing callers); assemble dedups by midpoint-in-body.
- TranscriptPipeline passes body bounds through.
Complements (doesn't replace) the fragment-smoothing + reconciliation safety nets;
this is the upstream fix. ~+20% backend audio per interior chunk. 63/63 XCTest
(new: overlap window layout + boundary-segment dedup).
Investigating Grant's real 38-min group call: 'Marty' was a GARBAGE cluster (192
segs, 0.37s mean, 186 ≤2 words, 125 single words flanked by the same other speaker —
diarization micro-fragments split mid-sentence, then LLM-named 'Marty'). Same for
'Message'/'HI'.
- SpeakerReconciler.smoothFragments: dissolve non-self clusters whose MEDIAN segment
duration ≤ 1s (≥3 segs) — reassign each fragment to the temporally-nearest real
speaker. (Median, not max, so one stray long segment can't rescue a fragment
cluster — the bug in the first cut.) On the real call: 7 speakers (3 junk) → 4 real
(Marty/Message/HI absorbed into Grant/Jonathan/Me/MH). Runs before LLM naming.
- LLM naming guardrails: forbid assigning the self name or ANY already-taken name to
another voice (fixes 'Grant' = the user's name pinned on a remote speaker); prompt
demands self-intro / direct-address evidence (mention ≠ presence), 'precision over
coverage', one name per speaker.
- Open saved session now offers Open Editor vs Re-process, so newer logic can be
applied to past calls (+ always-visible progress from the prior fix).
NOTE: the self-name guardrail needs the app to KNOW the user's name — selfName is still
'Me', so set it in Settings (e.g. 'Grant') so the LLM can't reuse it. 62/62 XCTest.
The status line only rendered inside the last-in-memory-session block, so 'Open
saved session' processed invisibly — looked like nothing happened. Now: the
transcript status (with a spinner) is always shown, the processing(0,0) reconcile
phase reads 'Working… (this can take a few minutes)', and invalid picks surface an
alert (not a recorded session / already processing / unreadable transcript) instead
of doing nothing.
Reconciliation (the marry-the-signals layer): after transcription, before the recap,
SpeakerReconciler (1) MERGES non-self clusters whose voiceprints are highly similar
(cosine >= 0.82) — fixes a person split across chunks (the real 1-on-1 failure: one
remote came back as 'MH' + 'Unknown_0'); and (2) NAMES remaining non-self clusters
from transcript CONTENT via the gateway LLM (people addressed by name / self-intros),
conservative + confidence-gated, keeping the placeholder when unrevealed. The
mic-channel self is protected and never reassigned. Voice does the segmentation; the
fingerprint-merge fixes splits; the LLM adds the content signal visual/voiceprint lack.
- SpeakerReconciler: pure cosine merge (tested) + LLM content-naming pass; rewrites
speakers.json before recap. SessionController.finishBackend shares one model lookup
for reconcile + recap. Gated by settings.reconcileSpeakers (default on).
- Open saved session: menu 'Open saved session…' → folder picker. Edits it if already
transcribed, else reconstructs inputs from disk (visual_timeline vision segs +
channel self-spans) and runs transcribe → reconcile → recap, then opens the editor.
Lets you evaluate/correct ANY past call, not just the in-memory last one.
Note (from real Signal data): visual naming is unreliable on Signal (sparse, misread
initials, lowercase/center names) — so reconciliation + the editor (which teaches
voiceprints on confirm) carry it; the editor remains the human arbiter. 59/59 XCTest.
Takeaways categories are no longer hardcoded — they're editable templates. A
template = the always-on TLDR + an ordered list of sections, each with a title, a
type (attributed items / bulleted list / paragraph), and an instruction (the prompt
text for that category). The analyzer assembles the LLM prompt FROM the template
and parses generically, so adding/removing/renaming a category needs zero code and
the output always renders.
- RecapTemplate / TemplateSection / SectionKind + TopicGranularity; built-in
defaults (Internal Meeting, 1:1, Company/Sales Call), all editable.
- Generic extras: RecapExtras{tldr, primarySpeakers, sections:[RenderedSection]} +
RecapItem{text,who,when,note} replaces the fixed MeetingExtras. Analyzer builds
per-section sec_N fields + parses by kind; renderer + remap are generic.
- Topic granularity (coarse/auto/fine) answers 'should chunking be configurable' —
it scales the target topic count; raw window sizes stay as tuned defaults.
- AppSettings persists templates + defaultTemplateId (seeded once). Settings gets a
default-template picker + 'Manage…' → TemplatesView (CRUD, edit sections/
instructions, set default, **Preview prompt** for full transparency).
- Recap editor gains a template picker; Regenerate uses the chosen template. Auto
recap uses the default template.
54/54 XCTest (template prompt build, generic parse/remap/render updated).
Adds a 'Regenerate recap' action so corrected speaker names flow into freshly
written summaries/extras (not just find-replaced). regenerate() commits the
corrections (rewrite speakers.json + reconcile voiceprints), re-runs RecapAnalyzer
on the corrected transcript via the gateway LLM, and rewrites recap.json +
transcript.md + recap.html. save() and regenerate() share commitCorrections();
both rebaseline the speaker set afterward so further edits map cleanly. Editor view
gains the button + progress spinner; RecapEditModel takes the gateway baseURL/skipTLS.
52/52 XCTest; builds clean.
Native editor to fix speaker-ID errors after transcription (modeled on recap-relay's
correction UX): rename a speaker in the legend, merge two speakers, or reassign an
individual transcript line. Saving rewrites speakers.json, re-renders transcript.md +
recap.html, and updates the voiceprint memory — so a correction compounds: naming an
"Unknown" speaker teaches that voice for future calls.
- SpeakerEditing (pure, tested): replaceSpeaker (rename = merge-onto-existing),
reassign, netNameMap (compose ops), and remap (apply a name map to a recap's
structured fields + whole-word free text, so summaries/extras update without re-LLM).
- RecapEditModel (@MainActor): loads speakers.json (+ optional recap.json +
cluster_fingerprints.json); on save writes the resolved speakers.json, re-renders,
and reconciles voiceprints — merge keeps the survivor's print; rename/name-an-Unknown
enrolls the cluster's fingerprint under the new name.
- TranscriptEditorView (SwiftUI) + EditorWindow (AppKit window for the LSUIElement app);
menu gains "Edit speakers".
- Pipeline now persists cluster_fingerprints.json (every cluster incl. Unknown) and
recap.json (RecapFile) so the editor can learn voices + re-render offline.
- RecapModels made Codable; TranscriptAssembler exposes allFingerprints;
VoiceprintStore gains enroll() + merge().
52/52 XCTest (6 new, incl. a full rename→artifacts→voiceprint round-trip on disk).
New 'Recap' phase — turns speakers.json into a human-readable recap, leveraging
recap-relay's proven logic/prompts but calling the Spark gateway's OpenAI-compatible
/v1/chat/completions directly (same host/TLS as label-merge; Qwen3-35B). We start
from already-named speakers (label-merge), so recap-relay's speaker clustering +
name-inference are skipped entirely.
- GatewayLLMClient: /v1/chat/completions (JSON mode), model discovery via
/api/endpoints, TLS-skip reuse, 503 retry, sequential.
- RecapAnalyzer: speakers.json → numbered [N] (MM:SS) Name: text transcript →
time-windowed analyze (single window for short calls, 18min/2min overlap for long)
→ stitch/dedup topic sections → meeting extras (TLDR/decisions/action_items/
open_questions/key_quotes). Defensive JSON parsing of LLM output.
- RecapRenderer: writes transcript.md + a self-contained dark-theme recap.html
(topic sections w/ collapsible transcripts, extras panels, speaker color chips,
full timestamped speaker-attributed transcript, print styles).
- SessionController.buildRecap: best-effort after speakers.json (gated by
settings.recapEnabled); surfaces recapURL → menu 'Open recap'. Skips silently if
the gateway has no LLM. Settings toggle added.
Validated END-TO-END on the real Meet session against the live gateway: dual-channel
transcription → 3 topic sections + accurate TLDR + key quotes; 'Go Bitcoin'
correctly attributed to the remote speaker. 46/46 XCTest (10 new).
The backend shipped dual-channel mode; wire the client to it. We already capture
mic (you) and system (others) separately, so send them as two files instead of the
mono mix — fixing the misattribution at the source.
- SparkControlClient: labelMergeDual(mic_file, system_file, self_name, self_vad);
multipart generalized to N files; shared POST/retry/decode extracted.
- SessionPackager.rebasedSelfVadData: chunk-local [{start,end}] for self_vad;
sliceAudio reused for both tracks.
- TranscriptPipeline.process: dual-channel chunking (slice mic+system, rebase
timeline + self_vad per chunk) when system audio is healthy; mono mixed-file
fallback (self folded into the timeline) otherwise.
- VisualCapture.finish: write the full visual_timeline.json (remote + self merged)
but return REMOTE (vision) segments only — self travels via the mic channel.
- TranscriptAssembler: rank mic_channel highest (the user's own track wins).
- VoiceprintStore: store the clean mic_channel self voiceprint.
- SessionController: pass mic/system URLs + remote timeline + channel self-spans +
self_name + systemHealthy; self_vad.json now reflects the channel-verified spans.
Validated END-TO-END against the live backend on the real misattributing session:
'Go Bitcoin' (remote) is now attributed to Unknown_0, NOT the user; the user's own
lines come back source=mic_channel; per-channel ASR recovered fuller remote text.
36/36 XCTest (4 new: self_vad rebase, mic_channel ranking + voiceprint storage).
Grant's insight + proven on real session audio: we capture self (mic) and others
(system) as separate tracks, then throw the separation away by mixing to mono — so
the backend has to re-guess who's who. Analysis of a real call showed the channels
are cleanly separated (envelope corr 0.015, NO echo); Caitlyn's 'Go Bitcoin' was
11.8x louder in system than mic, yet the mono mix + noisy visual named it 'Grant'.
ChannelSelfVAD marks self-speech as windows where the mic is active AND louder than
system (mic > system x1.5). Benefits: (1) self is identified by CHANNEL, not by the
on-screen name — set one name in Settings, no per-platform matching; (2) a remote
speaker (or room echo) can never be mislabeled as self. Computed at finalize from
the two finished WAVs; the live capture path is untouched. Falls back to mic-VAD if
tracks can't be read. SessionController feeds these spans to the backend timeline.
Validated on the real session: 16 self spans; 'Go Bitcoin' (72-74s) correctly
EXCLUDED, Grant's 49.9-53.3s / 62.6-64s correctly INCLUDED. 33/33 XCTest (5 new).
Visual capture falls back to audio-only silently, so the user couldn't tell if
it attached on a real call. SessionInfo now carries visualSegmentCount (nil =
audio-only; a count = visual ran, with that many vision-detected speaker
segments), shown in the menu as '… · N visual segments' or '… · audio-only'.
Makes the pending live-call validation unambiguous.
Visual capture now runs alongside audio: on call start the session picks the
app's adapter, captures the call window on the SAME monotonic clock as the audio
(AudioRecorder.sharedT0Host), and on stop writes visual_timeline.json and hands
the backend the visual segments with mic-VAD self-spans merged. Any visual
failure (no adapter, no window, Screen Recording denied) leaves the session
recording audio-only — the proven path is never blocked or broken.
- CallDetector now emits DetectedCall{app, bundleID, windowID}: the exact
CGWindowID of the matched Meet browser window (native apps → nil → largest).
- VisualCapture wraps VisualObserver + AdapterRegistry, writes visual_timeline.json.
- AudioRecorder.sharedT0Host() exposes the shared t0 for frame alignment.
Hardened per a 3-lens adversarial review (concurrency / failure-isolation /
data-flow), all 6 confirmed findings fixed:
- P0 (critical): startVisual could adopt a stale capture into a DIFFERENT session
(cross-session SCStream leak + visual_timeline.json written to the wrong
folder). Now gated on session identity — generation + recorder ===, still
.recording — with fail-closed adoption; otherwise the stream is cancelled.
- P1: observer captured the browser's largest window, not the detected Meet
window. Now targets the exact CGWindowID (pickWindowIndex, unit-tested),
largest-area only as fallback.
- P2: a startVisual orphaned by a concurrent stop could leak a stream on quit.
inFlightVisual is registered before the await and drained in prepareForTermination.
- P3: trailing visual gap/segment ends could exceed duration_sec. Clamped in
VisualCapture (clampSegments/clampGaps, unit-tested).
- P4: capture pixel size used NSScreen.main scale; now uses the scale of the
display actually hosting the window (OCR clarity on secondary displays).
- VisualObserver.stop() bounds stopCapture() with a 3s timeout (mirrors audio) so
a wedged stream can't hang finalization.
25/25 XCTest pass. Live validation on real calls still pending.
AudioRecorder captures system audio (ScreenCaptureKit) + mic (AVAudioEngine) on a
single serial ioQueue, one shared monotonic t0, time-driven writers (pad gaps /
trim overlaps) so tracks stay aligned, and an energy mic-VAD for 'self' spans.
AudioMixer sums the aligned tracks into mixed_mono_16k.wav. SessionController
drives a serialized start/stop state machine, writes the session folder +
self_vad.json, exposes live level meters, and finalizes on quit.
Hardening from review: ioQueue single-domain (no races), stop() never hangs
(mic-first teardown + bounded stopCapture), layout-agnostic mic deep-copy,
discard-only video output to keep SCStream alive, VAD lockstep on committed
frames, stable signing team in project.yml, single-instance enforcement.