Phase 0: menu-bar scaffold, permissions, backend health check

Native SwiftUI menu-bar app (LSUIElement, macOS 13+), generated from project.yml via XcodeGen. Includes: - PermissionsManager (Microphone / Screen Recording / Accessibility) + UI - SparkControlHealth: GET /api/status over self-signed TLS (InsecureTrustDelegate) - AppSettings persistence (host, TLS-skip, output folder, adapter toggles) - Menu-bar panel + Settings, app sandbox & hardened runtime off (LAN tool)
2026-06-05 19:33:53 -05:00
commit b2ae3a62b9
19 changed files with 1448 additions and 0 deletions
@@ -0,0 +1,123 @@
+# Build Plan — Ten31 Transcripts
+
+Companion to docs 01–03. Phased plan for the Claude Code session, each phase with
+a demoable milestone. Build in order; the risky/novel work (visual adapters) is
+isolated for independent tuning. The SparkControl contract is now known
+(`03_DATA_CONTRACTS.md`), so Phase 5 wires the real endpoints.
+
+---
+
+## Ground rules
+- **Native Swift + SwiftUI**, menu-bar app (`LSUIElement = true`). macOS 13.0+
+  (ScreenCaptureKit system audio).
+- **Never write video frames to disk.** Process in-memory, release immediately;
+  assert this in `VisualObserver` review.
+- **One shared monotonic `t0`** for audio + visual timestamps — wire first.
+- **Backend deliverable = `mixed_mono_16k.wav`.** Keep `mic.wav`/`system.wav`
+  locally (mic = the user, and the VAD source for self-labeling).
+- **SparkControl calls are SEQUENTIAL** (one audio request in flight) over
+  self-signed TLS, **no auth**, primary endpoint `POST /api/audio/label-merge`.
+- Every adapter must be testable offline against image fixtures.
+
+---
+
+## Phase 0 — Scaffold, permissions, backend ping (milestone: launches + green)
+- Menu-bar shell; permissions onboarding (Screen Recording, Microphone,
+  Accessibility) with status + deep links.
+- Settings: output folder, adapter toggles, **SparkControl base URL** + TLS-skip
+  toggle.
+- `GET /api/status` health check shown in the menu.
+- **Milestone:** app sits in the menu bar; permissions green; backend reachable.
+
+## Phase 1 — Audio capture + mix (milestone: clean mixed-mono WAV)
+- `AudioRecorder`: SCK system audio → `system.wav`; AVAudioEngine mic → `mic.wav`;
+  shared `t0`; produce **`mixed_mono_16k.wav`**.
+- Mic VAD → user "self" speaking spans (held for the timeline).
+- Manual start/stop for now.
+- **Milestone:** record any call manually → tracks line up, mixed-mono plays back
+  clean, mic-VAD spans look right.
+
+## Phase 2 — Call detection (milestone: hands-free start/stop)
+- `CallDetector`: CoreAudio mic-running listener + `NSWorkspace` app/frontmost +
+  Meet active-tab-URL check; debounce.
+- Wire detector → `SessionController` → `AudioRecorder`.
+- **Milestone:** joining a call in any of the four apps auto-starts; leaving
+  auto-stops; a quick mic test does not false-trigger.
+
+## Phase 3 — VisualObserver + Zoom adapter (milestone: visual_timeline.json)
+- `VisualObserver`: window-scoped `SCStream` @3 fps → frames → adapter → release.
+  Background/occluded windows capture fine; detect **minimized** (and, for Meet,
+  **tab-switched**) → pause analysis + record a `visual_gap`, never emit stale
+  frames. Audio keeps recording through gaps.
+- `AppAdapter` protocol + `ZoomAdapter` (tile-border cue + OCR names; speaker &
+  gallery layouts).
+- `TimelineBuilder`: hysteresis, overlap-allowed, merge mic-VAD self spans, alias
+  normalization → `visual_timeline.json` (+ the flat `timeline` array form).
+- **Offline fixtures:** ~30 sample Zoom frames; unit-test/tune thresholds without
+  a live call.
+- **Milestone:** a real Zoom call yields a `visual_timeline.json` whose segments
+  visibly match who spoke (incl. the user from mic VAD).
+
+## Phase 4 — Remaining adapters (milestone: all four produce timelines)
+- `TeamsAdapter`, `SignalAdapter` (a11y names first), `MeetAdapter` (hybrid: a11y
+  names + Vision cue). Fixtures per app; per-adapter fps/threshold config.
+- **Milestone:** each app yields a sensible timeline; adapters independently
+  toggleable.
+
+## Phase 5 — SparkControl hand-off via label-merge (milestone: end-to-end named transcript)
+- `SessionPackager`: write session folder; if call > ~3 min, build the chunk plan
+  (~2–3 min windows) with **timeline sliced + rebased to chunk-local seconds**.
+- `SparkControlClient`:
+  - Self-signed TLS handling; multipart `file`; **sequential** requests.
+  - Per chunk → `POST /api/audio/label-merge` with `file`, chunk `timeline`,
+    `known_voiceprints` (from store), `transcribe=true`.
+  - Handle `503 + Retry-After` (wait, retry), `413`, `{"detail":...}` errors.
+  - Offset returned `start_ms/end_ms` back to global seconds; concatenate.
+- Assemble `speakers.json` (named, global, speaker-attributed transcript).
+- Failed sessions stay on disk with a "Resend" affordance.
+- **Milestone:** finish a call → `mixed_mono_16k.wav` + timeline delivered → a
+  `speakers.json` with real names + accurate boundaries lands locally. Test a
+  >3-min call to exercise chunking + cross-chunk name unification.
+
+## Phase 6 — Voiceprint library (milestone: identity that compounds)
+- `VoiceprintStore` (`voiceprints.json`): load → `known_voiceprints`; on response,
+  store/refresh `fingerprint` for `visual`/high-confidence speakers; skip
+  `Unknown_N`.
+- Verify the three resolution paths against the backend: (a) **visual** name wins;
+  (b) **voiceprint** recovery for a camera-off / OCR-missed speaker; (c)
+  **`Unknown_N`** when neither matches.
+- Voiceprint manager in the UI (rename, delete, reset).
+- **Milestone:** a regular contact gets named on a later call **with their camera
+  off**, purely from a stored voiceprint.
+
+## Phase 7 — Polish (milestone: daily-driver quality)
+- Recent-sessions list (open/resend/delete); alias-table editor.
+- Optional on-screen "recording" indicator toggle.
+- Adapter health: warn when detection confidence drops sharply (catches UI
+  redesigns early); keep fixtures for quick re-tuning.
+- Login-item / launch-at-login.
+
+---
+
+## Risk register
+1. **UI fragility (highest ongoing cost):** apps redesign speaking cues / name
+   placement. Isolate all pixel/color/label logic in adapters; Phase-7
+   confidence-drop warning + fixtures keep re-tuning fast.
+2. **Active-speaker lag / crosstalk:** UI cue trails speech and flickers. Mitigated
+   by treating visual as a *prior* (the backend votes), hysteresis, and the
+   voiceprint fallback.
+3. **Meet-in-browser:** canvas video + DOM names; the hybrid adapter needs the most
+   iteration — budget extra time.
+4. **4-speaker-per-chunk cap (Sortformer):** within a single chunk, >4 simultaneous
+   speakers can't all be separated. Chunking + voiceprints + the visual timeline
+   mitigate across chunks; for rare large calls, note it as a known limit.
+5. **Sequential GPU constraint:** never fire parallel audio requests; queue them.
+6. **Permission re-prompts on rebuild:** stable signing identity.
+
+## Definition of done (v1)
+Join a call in any of the four apps → app auto-records dual-track audio, mixes to
+mono, and builds a visual speaker timeline (incl. mic-VAD self spans) → on
+hang-up the audio + timeline go to `label-merge` (chunked + sequential as needed)
+→ a `speakers.json` with **real names and accurate boundaries** is produced, the
+**voiceprint library is updated**, and **no video is ever written to disk** and
+**no transcription/diarization/merge logic lives in this app.**