Files
ten31-transcripts/docs/04_BUILD_PLAN.md
T
Grant Gilliam b2ae3a62b9 Phase 0: menu-bar scaffold, permissions, backend health check
Native SwiftUI menu-bar app (LSUIElement, macOS 13+), generated from project.yml
via XcodeGen. Includes:
- PermissionsManager (Microphone / Screen Recording / Accessibility) + UI
- SparkControlHealth: GET /api/status over self-signed TLS (InsecureTrustDelegate)
- AppSettings persistence (host, TLS-skip, output folder, adapter toggles)
- Menu-bar panel + Settings, app sandbox & hardened runtime off (LAN tool)
2026-06-05 19:33:53 -05:00

6.7 KiB
Raw Blame History

Build Plan — Ten31 Transcripts

Companion to docs 0103. Phased plan for the Claude Code session, each phase with a demoable milestone. Build in order; the risky/novel work (visual adapters) is isolated for independent tuning. The SparkControl contract is now known (03_DATA_CONTRACTS.md), so Phase 5 wires the real endpoints.


Ground rules

  • Native Swift + SwiftUI, menu-bar app (LSUIElement = true). macOS 13.0+ (ScreenCaptureKit system audio).
  • Never write video frames to disk. Process in-memory, release immediately; assert this in VisualObserver review.
  • One shared monotonic t0 for audio + visual timestamps — wire first.
  • Backend deliverable = mixed_mono_16k.wav. Keep mic.wav/system.wav locally (mic = the user, and the VAD source for self-labeling).
  • SparkControl calls are SEQUENTIAL (one audio request in flight) over self-signed TLS, no auth, primary endpoint POST /api/audio/label-merge.
  • Every adapter must be testable offline against image fixtures.

Phase 0 — Scaffold, permissions, backend ping (milestone: launches + green)

  • Menu-bar shell; permissions onboarding (Screen Recording, Microphone, Accessibility) with status + deep links.
  • Settings: output folder, adapter toggles, SparkControl base URL + TLS-skip toggle.
  • GET /api/status health check shown in the menu.
  • Milestone: app sits in the menu bar; permissions green; backend reachable.

Phase 1 — Audio capture + mix (milestone: clean mixed-mono WAV)

  • AudioRecorder: SCK system audio → system.wav; AVAudioEngine mic → mic.wav; shared t0; produce mixed_mono_16k.wav.
  • Mic VAD → user "self" speaking spans (held for the timeline).
  • Manual start/stop for now.
  • Milestone: record any call manually → tracks line up, mixed-mono plays back clean, mic-VAD spans look right.

Phase 2 — Call detection (milestone: hands-free start/stop)

  • CallDetector: CoreAudio mic-running listener + NSWorkspace app/frontmost + Meet active-tab-URL check; debounce.
  • Wire detector → SessionControllerAudioRecorder.
  • Milestone: joining a call in any of the four apps auto-starts; leaving auto-stops; a quick mic test does not false-trigger.

Phase 3 — VisualObserver + Zoom adapter (milestone: visual_timeline.json)

  • VisualObserver: window-scoped SCStream @3 fps → frames → adapter → release. Background/occluded windows capture fine; detect minimized (and, for Meet, tab-switched) → pause analysis + record a visual_gap, never emit stale frames. Audio keeps recording through gaps.
  • AppAdapter protocol + ZoomAdapter (tile-border cue + OCR names; speaker & gallery layouts).
  • TimelineBuilder: hysteresis, overlap-allowed, merge mic-VAD self spans, alias normalization → visual_timeline.json (+ the flat timeline array form).
  • Offline fixtures: ~30 sample Zoom frames; unit-test/tune thresholds without a live call.
  • Milestone: a real Zoom call yields a visual_timeline.json whose segments visibly match who spoke (incl. the user from mic VAD).

Phase 4 — Remaining adapters (milestone: all four produce timelines)

  • TeamsAdapter, SignalAdapter (a11y names first), MeetAdapter (hybrid: a11y names + Vision cue). Fixtures per app; per-adapter fps/threshold config.
  • Milestone: each app yields a sensible timeline; adapters independently toggleable.

Phase 5 — SparkControl hand-off via label-merge (milestone: end-to-end named transcript)

  • SessionPackager: write session folder; if call > ~3 min, build the chunk plan (~23 min windows) with timeline sliced + rebased to chunk-local seconds.
  • SparkControlClient:
    • Self-signed TLS handling; multipart file; sequential requests.
    • Per chunk → POST /api/audio/label-merge with file, chunk timeline, known_voiceprints (from store), transcribe=true.
    • Handle 503 + Retry-After (wait, retry), 413, {"detail":...} errors.
    • Offset returned start_ms/end_ms back to global seconds; concatenate.
  • Assemble speakers.json (named, global, speaker-attributed transcript).
  • Failed sessions stay on disk with a "Resend" affordance.
  • Milestone: finish a call → mixed_mono_16k.wav + timeline delivered → a speakers.json with real names + accurate boundaries lands locally. Test a

    3-min call to exercise chunking + cross-chunk name unification.

Phase 6 — Voiceprint library (milestone: identity that compounds)

  • VoiceprintStore (voiceprints.json): load → known_voiceprints; on response, store/refresh fingerprint for visual/high-confidence speakers; skip Unknown_N.
  • Verify the three resolution paths against the backend: (a) visual name wins; (b) voiceprint recovery for a camera-off / OCR-missed speaker; (c) Unknown_N when neither matches.
  • Voiceprint manager in the UI (rename, delete, reset).
  • Milestone: a regular contact gets named on a later call with their camera off, purely from a stored voiceprint.

Phase 7 — Polish (milestone: daily-driver quality)

  • Recent-sessions list (open/resend/delete); alias-table editor.
  • Optional on-screen "recording" indicator toggle.
  • Adapter health: warn when detection confidence drops sharply (catches UI redesigns early); keep fixtures for quick re-tuning.
  • Login-item / launch-at-login.

Risk register

  1. UI fragility (highest ongoing cost): apps redesign speaking cues / name placement. Isolate all pixel/color/label logic in adapters; Phase-7 confidence-drop warning + fixtures keep re-tuning fast.
  2. Active-speaker lag / crosstalk: UI cue trails speech and flickers. Mitigated by treating visual as a prior (the backend votes), hysteresis, and the voiceprint fallback.
  3. Meet-in-browser: canvas video + DOM names; the hybrid adapter needs the most iteration — budget extra time.
  4. 4-speaker-per-chunk cap (Sortformer): within a single chunk, >4 simultaneous speakers can't all be separated. Chunking + voiceprints + the visual timeline mitigate across chunks; for rare large calls, note it as a known limit.
  5. Sequential GPU constraint: never fire parallel audio requests; queue them.
  6. Permission re-prompts on rebuild: stable signing identity.

Definition of done (v1)

Join a call in any of the four apps → app auto-records dual-track audio, mixes to mono, and builds a visual speaker timeline (incl. mic-VAD self spans) → on hang-up the audio + timeline go to label-merge (chunked + sequential as needed) → a speakers.json with real names and accurate boundaries is produced, the voiceprint library is updated, and no video is ever written to disk and no transcription/diarization/merge logic lives in this app.