# Build Plan — Ten31 Transcripts > **Status: COMPLETE (historical).** Phases 0–6 shipped and the app is in daily > use; a recap phase (transcript + HTML recap via the backend LLM) was added after > this plan was written. Kept as the original build log and as the map for the > "Phase N" references in the code comments. Forward-looking work lives in > `ROADMAP.md`; current status in `AGENTS.md`. Companion to docs 01–03. Phased plan for the Claude Code session, each phase with a demoable milestone. Build in order; the risky/novel work (visual adapters) is isolated for independent tuning. The SparkControl contract is now known (`03_DATA_CONTRACTS.md`), so Phase 5 wires the real endpoints. --- ## Ground rules - **Native Swift + SwiftUI**, menu-bar app (`LSUIElement = true`). macOS 13.0+ (ScreenCaptureKit system audio). - **Never write video frames to disk.** Process in-memory, release immediately; assert this in `VisualObserver` review. - **One shared monotonic `t0`** for audio + visual timestamps — wire first. - **Backend deliverable = `mixed_mono_16k.wav`.** Keep `mic.wav`/`system.wav` locally (mic = the user, and the VAD source for self-labeling). - **SparkControl calls are SEQUENTIAL** (one audio request in flight) over self-signed TLS, **no auth**, primary endpoint `POST /api/audio/label-merge`. - Every adapter must be testable offline against image fixtures. --- ## Phase 0 — Scaffold, permissions, backend ping (milestone: launches + green) - Menu-bar shell; permissions onboarding (Screen Recording, Microphone, Accessibility) with status + deep links. - Settings: output folder, adapter toggles, **SparkControl base URL** + TLS-skip toggle. - `GET /api/status` health check shown in the menu. - **Milestone:** app sits in the menu bar; permissions green; backend reachable. ## Phase 1 — Audio capture + mix (milestone: clean mixed-mono WAV) - `AudioRecorder`: SCK system audio → `system.wav`; AVAudioEngine mic → `mic.wav`; shared `t0`; produce **`mixed_mono_16k.wav`**. - Mic VAD → user "self" speaking spans (held for the timeline). - Manual start/stop for now. - **Milestone:** record any call manually → tracks line up, mixed-mono plays back clean, mic-VAD spans look right. ## Phase 2 — Call detection (milestone: hands-free start/stop) - `CallDetector`: CoreAudio mic-running listener + `NSWorkspace` app/frontmost + Meet active-tab-URL check; debounce. - Wire detector → `SessionController` → `AudioRecorder`. - **Milestone:** joining a call in any of the four apps auto-starts; leaving auto-stops; a quick mic test does not false-trigger. ## Phase 3 — VisualObserver + Zoom adapter (milestone: visual_timeline.json) - `VisualObserver`: window-scoped `SCStream` @3 fps → frames → adapter → release. Background/occluded windows capture fine; detect **minimized** (and, for Meet, **tab-switched**) → pause analysis + record a `visual_gap`, never emit stale frames. Audio keeps recording through gaps. - `AppAdapter` protocol + `ZoomAdapter` (tile-border cue + OCR names; speaker & gallery layouts). - `TimelineBuilder`: hysteresis, overlap-allowed, merge mic-VAD self spans, alias normalization → `visual_timeline.json` (+ the flat `timeline` array form). - **Offline fixtures:** ~30 sample Zoom frames; unit-test/tune thresholds without a live call. - **Milestone:** a real Zoom call yields a `visual_timeline.json` whose segments visibly match who spoke (incl. the user from mic VAD). ## Phase 4 — Remaining adapters (milestone: all four produce timelines) - `TeamsAdapter`, `SignalAdapter` (a11y names first), `MeetAdapter` (hybrid: a11y names + Vision cue). Fixtures per app; per-adapter fps/threshold config. - **Milestone:** each app yields a sensible timeline; adapters independently toggleable. ## Phase 5 — SparkControl hand-off via label-merge (milestone: end-to-end named transcript) - `SessionPackager`: write session folder; if call > ~3 min, build the chunk plan (~2–3 min windows) with **timeline sliced + rebased to chunk-local seconds**. - `SparkControlClient`: - Self-signed TLS handling; multipart `file`; **sequential** requests. - Per chunk → `POST /api/audio/label-merge` with `file`, chunk `timeline`, `known_voiceprints` (from store), `transcribe=true`. - Handle `503 + Retry-After` (wait, retry), `413`, `{"detail":...}` errors. - Offset returned `start_ms/end_ms` back to global seconds; concatenate. - Assemble `speakers.json` (named, global, speaker-attributed transcript). - Failed sessions stay on disk with a "Resend" affordance. - **Milestone:** finish a call → `mixed_mono_16k.wav` + timeline delivered → a `speakers.json` with real names + accurate boundaries lands locally. Test a >3-min call to exercise chunking + cross-chunk name unification. ## Phase 6 — Voiceprint library (milestone: identity that compounds) - `VoiceprintStore` (`voiceprints.json`): load → `known_voiceprints`; on response, store/refresh `fingerprint` for `visual`/high-confidence speakers; skip `Unknown_N`. - Verify the three resolution paths against the backend: (a) **visual** name wins; (b) **voiceprint** recovery for a camera-off / OCR-missed speaker; (c) **`Unknown_N`** when neither matches. - Voiceprint manager in the UI (rename, delete, reset). - **Milestone:** a regular contact gets named on a later call **with their camera off**, purely from a stored voiceprint. ## Phase 7 — Polish (milestone: daily-driver quality) - Recent-sessions list (open/resend/delete); alias-table editor. - Optional on-screen "recording" indicator toggle. - Adapter health: warn when detection confidence drops sharply (catches UI redesigns early); keep fixtures for quick re-tuning. - Login-item / launch-at-login. --- ## Risk register 1. **UI fragility (highest ongoing cost):** apps redesign speaking cues / name placement. Isolate all pixel/color/label logic in adapters; Phase-7 confidence-drop warning + fixtures keep re-tuning fast. 2. **Active-speaker lag / crosstalk:** UI cue trails speech and flickers. Mitigated by treating visual as a *prior* (the backend votes), hysteresis, and the voiceprint fallback. 3. **Meet-in-browser:** canvas video + DOM names; the hybrid adapter needs the most iteration — budget extra time. 4. **4-speaker-per-chunk cap (Sortformer):** within a single chunk, >4 simultaneous speakers can't all be separated. Chunking + voiceprints + the visual timeline mitigate across chunks; for rare large calls, note it as a known limit. 5. **Sequential GPU constraint:** never fire parallel audio requests; queue them. 6. **Permission re-prompts on rebuild:** stable signing identity. ## Definition of done (v1) Join a call in any of the four apps → app auto-records dual-track audio, mixes to mono, and builds a visual speaker timeline (incl. mic-VAD self spans) → on hang-up the audio + timeline go to `label-merge` (chunked + sequential as needed) → a `speakers.json` with **real names and accurate boundaries** is produced, the **voiceprint library is updated**, and **no video is ever written to disk** and **no transcription/diarization/merge logic lives in this app.**