ten31-transcripts/docs/04_BUILD_PLAN.md

# Build Plan — Ten31 Transcripts

Companion to docs 01–03. Phased plan for the Claude Code session, each phase with
a demoable milestone. Build in order; the risky/novel work (visual adapters) is
isolated for independent tuning. The SparkControl contract is now known
(`03_DATA_CONTRACTS.md`), so Phase 5 wires the real endpoints.

---

## Ground rules
- **Native Swift + SwiftUI**, menu-bar app (`LSUIElement = true`). macOS 13.0+
  (ScreenCaptureKit system audio).
- **Never write video frames to disk.** Process in-memory, release immediately;
  assert this in `VisualObserver` review.
- **One shared monotonic `t0`** for audio + visual timestamps — wire first.
- **Backend deliverable = `mixed_mono_16k.wav`.** Keep `mic.wav`/`system.wav`
  locally (mic = the user, and the VAD source for self-labeling).
- **SparkControl calls are SEQUENTIAL** (one audio request in flight) over
  self-signed TLS, **no auth**, primary endpoint `POST /api/audio/label-merge`.
- Every adapter must be testable offline against image fixtures.

---

## Phase 0 — Scaffold, permissions, backend ping (milestone: launches + green)
- Menu-bar shell; permissions onboarding (Screen Recording, Microphone,
  Accessibility) with status + deep links.
- Settings: output folder, adapter toggles, **SparkControl base URL** + TLS-skip
  toggle.
- `GET /api/status` health check shown in the menu.
- **Milestone:** app sits in the menu bar; permissions green; backend reachable.

## Phase 1 — Audio capture + mix (milestone: clean mixed-mono WAV)
- `AudioRecorder`: SCK system audio → `system.wav`; AVAudioEngine mic → `mic.wav`;
  shared `t0`; produce **`mixed_mono_16k.wav`**.
- Mic VAD → user "self" speaking spans (held for the timeline).
- Manual start/stop for now.
- **Milestone:** record any call manually → tracks line up, mixed-mono plays back
  clean, mic-VAD spans look right.

## Phase 2 — Call detection (milestone: hands-free start/stop)
- `CallDetector`: CoreAudio mic-running listener + `NSWorkspace` app/frontmost +
  Meet active-tab-URL check; debounce.
- Wire detector → `SessionController` → `AudioRecorder`.
- **Milestone:** joining a call in any of the four apps auto-starts; leaving
  auto-stops; a quick mic test does not false-trigger.

## Phase 3 — VisualObserver + Zoom adapter (milestone: visual_timeline.json)
- `VisualObserver`: window-scoped `SCStream` @3 fps → frames → adapter → release.
  Background/occluded windows capture fine; detect **minimized** (and, for Meet,
  **tab-switched**) → pause analysis + record a `visual_gap`, never emit stale
  frames. Audio keeps recording through gaps.
- `AppAdapter` protocol + `ZoomAdapter` (tile-border cue + OCR names; speaker &
  gallery layouts).
- `TimelineBuilder`: hysteresis, overlap-allowed, merge mic-VAD self spans, alias
  normalization → `visual_timeline.json` (+ the flat `timeline` array form).
- **Offline fixtures:** ~30 sample Zoom frames; unit-test/tune thresholds without
  a live call.
- **Milestone:** a real Zoom call yields a `visual_timeline.json` whose segments
  visibly match who spoke (incl. the user from mic VAD).

## Phase 4 — Remaining adapters (milestone: all four produce timelines)
- `TeamsAdapter`, `SignalAdapter` (a11y names first), `MeetAdapter` (hybrid: a11y
  names + Vision cue). Fixtures per app; per-adapter fps/threshold config.
- **Milestone:** each app yields a sensible timeline; adapters independently
  toggleable.

## Phase 5 — SparkControl hand-off via label-merge (milestone: end-to-end named transcript)
- `SessionPackager`: write session folder; if call > ~3 min, build the chunk plan
  (~2–3 min windows) with **timeline sliced + rebased to chunk-local seconds**.
- `SparkControlClient`:
  - Self-signed TLS handling; multipart `file`; **sequential** requests.
  - Per chunk → `POST /api/audio/label-merge` with `file`, chunk `timeline`,
    `known_voiceprints` (from store), `transcribe=true`.
  - Handle `503 + Retry-After` (wait, retry), `413`, `{"detail":...}` errors.
  - Offset returned `start_ms/end_ms` back to global seconds; concatenate.
- Assemble `speakers.json` (named, global, speaker-attributed transcript).
- Failed sessions stay on disk with a "Resend" affordance.
- **Milestone:** finish a call → `mixed_mono_16k.wav` + timeline delivered → a
  `speakers.json` with real names + accurate boundaries lands locally. Test a
  >3-min call to exercise chunking + cross-chunk name unification.

## Phase 6 — Voiceprint library (milestone: identity that compounds)
- `VoiceprintStore` (`voiceprints.json`): load → `known_voiceprints`; on response,
  store/refresh `fingerprint` for `visual`/high-confidence speakers; skip
  `Unknown_N`.
- Verify the three resolution paths against the backend: (a) **visual** name wins;
  (b) **voiceprint** recovery for a camera-off / OCR-missed speaker; (c)
  **`Unknown_N`** when neither matches.
- Voiceprint manager in the UI (rename, delete, reset).
- **Milestone:** a regular contact gets named on a later call **with their camera
  off**, purely from a stored voiceprint.

## Phase 7 — Polish (milestone: daily-driver quality)
- Recent-sessions list (open/resend/delete); alias-table editor.
- Optional on-screen "recording" indicator toggle.
- Adapter health: warn when detection confidence drops sharply (catches UI
  redesigns early); keep fixtures for quick re-tuning.
- Login-item / launch-at-login.

---

## Risk register
1. **UI fragility (highest ongoing cost):** apps redesign speaking cues / name
   placement. Isolate all pixel/color/label logic in adapters; Phase-7
   confidence-drop warning + fixtures keep re-tuning fast.
2. **Active-speaker lag / crosstalk:** UI cue trails speech and flickers. Mitigated
   by treating visual as a *prior* (the backend votes), hysteresis, and the
   voiceprint fallback.
3. **Meet-in-browser:** canvas video + DOM names; the hybrid adapter needs the most
   iteration — budget extra time.
4. **4-speaker-per-chunk cap (Sortformer):** within a single chunk, >4 simultaneous
   speakers can't all be separated. Chunking + voiceprints + the visual timeline
   mitigate across chunks; for rare large calls, note it as a known limit.
5. **Sequential GPU constraint:** never fire parallel audio requests; queue them.
6. **Permission re-prompts on rebuild:** stable signing identity.

## Definition of done (v1)
Join a call in any of the four apps → app auto-records dual-track audio, mixes to
mono, and builds a visual speaker timeline (incl. mic-VAD self spans) → on
hang-up the audio + timeline go to `label-merge` (chunked + sequential as needed)
→ a `speakers.json` with **real names and accurate boundaries** is produced, the
**voiceprint library is updated**, and **no video is ever written to disk** and
**no transcription/diarization/merge logic lives in this app.**