Document the dual-channel label-merge path (mic_file/system_file/self_name/self_vad) and the recap phase (transcript.md + recap.html via the backend LLM) across docs/01-03; correct docs/02 $2.10 to the UI actually shipped; mark docs/01 $7 open items as settled; remove the dead AUDIO_API.md references; note the manifest sha256 fields are not emitted; mark docs/04 as a complete/historical build log. Also drop the last stale "Phase 0" UI string in MenuBarView and retire the now-done doc-debt items in ROADMAP.
7.0 KiB
Build Plan — Ten31 Transcripts
Status: COMPLETE (historical). Phases 0–6 shipped and the app is in daily use; a recap phase (transcript + HTML recap via the backend LLM) was added after this plan was written. Kept as the original build log and as the map for the "Phase N" references in the code comments. Forward-looking work lives in
ROADMAP.md; current status inAGENTS.md.
Companion to docs 01–03. Phased plan for the Claude Code session, each phase with
a demoable milestone. Build in order; the risky/novel work (visual adapters) is
isolated for independent tuning. The SparkControl contract is now known
(03_DATA_CONTRACTS.md), so Phase 5 wires the real endpoints.
Ground rules
- Native Swift + SwiftUI, menu-bar app (
LSUIElement = true). macOS 13.0+ (ScreenCaptureKit system audio). - Never write video frames to disk. Process in-memory, release immediately;
assert this in
VisualObserverreview. - One shared monotonic
t0for audio + visual timestamps — wire first. - Backend deliverable =
mixed_mono_16k.wav. Keepmic.wav/system.wavlocally (mic = the user, and the VAD source for self-labeling). - SparkControl calls are SEQUENTIAL (one audio request in flight) over
self-signed TLS, no auth, primary endpoint
POST /api/audio/label-merge. - Every adapter must be testable offline against image fixtures.
Phase 0 — Scaffold, permissions, backend ping (milestone: launches + green)
- Menu-bar shell; permissions onboarding (Screen Recording, Microphone, Accessibility) with status + deep links.
- Settings: output folder, adapter toggles, SparkControl base URL + TLS-skip toggle.
GET /api/statushealth check shown in the menu.- Milestone: app sits in the menu bar; permissions green; backend reachable.
Phase 1 — Audio capture + mix (milestone: clean mixed-mono WAV)
AudioRecorder: SCK system audio →system.wav; AVAudioEngine mic →mic.wav; sharedt0; producemixed_mono_16k.wav.- Mic VAD → user "self" speaking spans (held for the timeline).
- Manual start/stop for now.
- Milestone: record any call manually → tracks line up, mixed-mono plays back clean, mic-VAD spans look right.
Phase 2 — Call detection (milestone: hands-free start/stop)
CallDetector: CoreAudio mic-running listener +NSWorkspaceapp/frontmost + Meet active-tab-URL check; debounce.- Wire detector →
SessionController→AudioRecorder. - Milestone: joining a call in any of the four apps auto-starts; leaving auto-stops; a quick mic test does not false-trigger.
Phase 3 — VisualObserver + Zoom adapter (milestone: visual_timeline.json)
VisualObserver: window-scopedSCStream@3 fps → frames → adapter → release. Background/occluded windows capture fine; detect minimized (and, for Meet, tab-switched) → pause analysis + record avisual_gap, never emit stale frames. Audio keeps recording through gaps.AppAdapterprotocol +ZoomAdapter(tile-border cue + OCR names; speaker & gallery layouts).TimelineBuilder: hysteresis, overlap-allowed, merge mic-VAD self spans, alias normalization →visual_timeline.json(+ the flattimelinearray form).- Offline fixtures: ~30 sample Zoom frames; unit-test/tune thresholds without a live call.
- Milestone: a real Zoom call yields a
visual_timeline.jsonwhose segments visibly match who spoke (incl. the user from mic VAD).
Phase 4 — Remaining adapters (milestone: all four produce timelines)
TeamsAdapter,SignalAdapter(a11y names first),MeetAdapter(hybrid: a11y names + Vision cue). Fixtures per app; per-adapter fps/threshold config.- Milestone: each app yields a sensible timeline; adapters independently toggleable.
Phase 5 — SparkControl hand-off via label-merge (milestone: end-to-end named transcript)
SessionPackager: write session folder; if call > ~3 min, build the chunk plan (~2–3 min windows) with timeline sliced + rebased to chunk-local seconds.SparkControlClient:- Self-signed TLS handling; multipart
file; sequential requests. - Per chunk →
POST /api/audio/label-mergewithfile, chunktimeline,known_voiceprints(from store),transcribe=true. - Handle
503 + Retry-After(wait, retry),413,{"detail":...}errors. - Offset returned
start_ms/end_msback to global seconds; concatenate.
- Self-signed TLS handling; multipart
- Assemble
speakers.json(named, global, speaker-attributed transcript). - Failed sessions stay on disk with a "Resend" affordance.
- Milestone: finish a call →
mixed_mono_16k.wav+ timeline delivered → aspeakers.jsonwith real names + accurate boundaries lands locally. Test a3-min call to exercise chunking + cross-chunk name unification.
Phase 6 — Voiceprint library (milestone: identity that compounds)
VoiceprintStore(voiceprints.json): load →known_voiceprints; on response, store/refreshfingerprintforvisual/high-confidence speakers; skipUnknown_N.- Verify the three resolution paths against the backend: (a) visual name wins;
(b) voiceprint recovery for a camera-off / OCR-missed speaker; (c)
Unknown_Nwhen neither matches. - Voiceprint manager in the UI (rename, delete, reset).
- Milestone: a regular contact gets named on a later call with their camera off, purely from a stored voiceprint.
Phase 7 — Polish (milestone: daily-driver quality)
- Recent-sessions list (open/resend/delete); alias-table editor.
- Optional on-screen "recording" indicator toggle.
- Adapter health: warn when detection confidence drops sharply (catches UI redesigns early); keep fixtures for quick re-tuning.
- Login-item / launch-at-login.
Risk register
- UI fragility (highest ongoing cost): apps redesign speaking cues / name placement. Isolate all pixel/color/label logic in adapters; Phase-7 confidence-drop warning + fixtures keep re-tuning fast.
- Active-speaker lag / crosstalk: UI cue trails speech and flickers. Mitigated by treating visual as a prior (the backend votes), hysteresis, and the voiceprint fallback.
- Meet-in-browser: canvas video + DOM names; the hybrid adapter needs the most iteration — budget extra time.
- 4-speaker-per-chunk cap (Sortformer): within a single chunk, >4 simultaneous speakers can't all be separated. Chunking + voiceprints + the visual timeline mitigate across chunks; for rare large calls, note it as a known limit.
- Sequential GPU constraint: never fire parallel audio requests; queue them.
- Permission re-prompts on rebuild: stable signing identity.
Definition of done (v1)
Join a call in any of the four apps → app auto-records dual-track audio, mixes to
mono, and builds a visual speaker timeline (incl. mic-VAD self spans) → on
hang-up the audio + timeline go to label-merge (chunked + sequential as needed)
→ a speakers.json with real names and accurate boundaries is produced, the
voiceprint library is updated, and no video is ever written to disk and
no transcription/diarization/merge logic lives in this app.