Files
Grant Gilliam dda4322de7 Reconcile docs/ specs with the shipped app
Document the dual-channel label-merge path (mic_file/system_file/self_name/self_vad) and the recap phase (transcript.md + recap.html via the backend LLM) across docs/01-03; correct docs/02 $2.10 to the UI actually shipped; mark docs/01 $7 open items as settled; remove the dead AUDIO_API.md references; note the manifest sha256 fields are not emitted; mark docs/04 as a complete/historical build log. Also drop the last stale "Phase 0" UI string in MenuBarView and retire the now-done doc-debt items in ROADMAP.
2026-06-16 22:09:04 -05:00

7.0 KiB
Raw Permalink Blame History

Build Plan — Ten31 Transcripts

Status: COMPLETE (historical). Phases 06 shipped and the app is in daily use; a recap phase (transcript + HTML recap via the backend LLM) was added after this plan was written. Kept as the original build log and as the map for the "Phase N" references in the code comments. Forward-looking work lives in ROADMAP.md; current status in AGENTS.md.

Companion to docs 0103. Phased plan for the Claude Code session, each phase with a demoable milestone. Build in order; the risky/novel work (visual adapters) is isolated for independent tuning. The SparkControl contract is now known (03_DATA_CONTRACTS.md), so Phase 5 wires the real endpoints.


Ground rules

  • Native Swift + SwiftUI, menu-bar app (LSUIElement = true). macOS 13.0+ (ScreenCaptureKit system audio).
  • Never write video frames to disk. Process in-memory, release immediately; assert this in VisualObserver review.
  • One shared monotonic t0 for audio + visual timestamps — wire first.
  • Backend deliverable = mixed_mono_16k.wav. Keep mic.wav/system.wav locally (mic = the user, and the VAD source for self-labeling).
  • SparkControl calls are SEQUENTIAL (one audio request in flight) over self-signed TLS, no auth, primary endpoint POST /api/audio/label-merge.
  • Every adapter must be testable offline against image fixtures.

Phase 0 — Scaffold, permissions, backend ping (milestone: launches + green)

  • Menu-bar shell; permissions onboarding (Screen Recording, Microphone, Accessibility) with status + deep links.
  • Settings: output folder, adapter toggles, SparkControl base URL + TLS-skip toggle.
  • GET /api/status health check shown in the menu.
  • Milestone: app sits in the menu bar; permissions green; backend reachable.

Phase 1 — Audio capture + mix (milestone: clean mixed-mono WAV)

  • AudioRecorder: SCK system audio → system.wav; AVAudioEngine mic → mic.wav; shared t0; produce mixed_mono_16k.wav.
  • Mic VAD → user "self" speaking spans (held for the timeline).
  • Manual start/stop for now.
  • Milestone: record any call manually → tracks line up, mixed-mono plays back clean, mic-VAD spans look right.

Phase 2 — Call detection (milestone: hands-free start/stop)

  • CallDetector: CoreAudio mic-running listener + NSWorkspace app/frontmost + Meet active-tab-URL check; debounce.
  • Wire detector → SessionControllerAudioRecorder.
  • Milestone: joining a call in any of the four apps auto-starts; leaving auto-stops; a quick mic test does not false-trigger.

Phase 3 — VisualObserver + Zoom adapter (milestone: visual_timeline.json)

  • VisualObserver: window-scoped SCStream @3 fps → frames → adapter → release. Background/occluded windows capture fine; detect minimized (and, for Meet, tab-switched) → pause analysis + record a visual_gap, never emit stale frames. Audio keeps recording through gaps.
  • AppAdapter protocol + ZoomAdapter (tile-border cue + OCR names; speaker & gallery layouts).
  • TimelineBuilder: hysteresis, overlap-allowed, merge mic-VAD self spans, alias normalization → visual_timeline.json (+ the flat timeline array form).
  • Offline fixtures: ~30 sample Zoom frames; unit-test/tune thresholds without a live call.
  • Milestone: a real Zoom call yields a visual_timeline.json whose segments visibly match who spoke (incl. the user from mic VAD).

Phase 4 — Remaining adapters (milestone: all four produce timelines)

  • TeamsAdapter, SignalAdapter (a11y names first), MeetAdapter (hybrid: a11y names + Vision cue). Fixtures per app; per-adapter fps/threshold config.
  • Milestone: each app yields a sensible timeline; adapters independently toggleable.

Phase 5 — SparkControl hand-off via label-merge (milestone: end-to-end named transcript)

  • SessionPackager: write session folder; if call > ~3 min, build the chunk plan (~23 min windows) with timeline sliced + rebased to chunk-local seconds.
  • SparkControlClient:
    • Self-signed TLS handling; multipart file; sequential requests.
    • Per chunk → POST /api/audio/label-merge with file, chunk timeline, known_voiceprints (from store), transcribe=true.
    • Handle 503 + Retry-After (wait, retry), 413, {"detail":...} errors.
    • Offset returned start_ms/end_ms back to global seconds; concatenate.
  • Assemble speakers.json (named, global, speaker-attributed transcript).
  • Failed sessions stay on disk with a "Resend" affordance.
  • Milestone: finish a call → mixed_mono_16k.wav + timeline delivered → a speakers.json with real names + accurate boundaries lands locally. Test a

    3-min call to exercise chunking + cross-chunk name unification.

Phase 6 — Voiceprint library (milestone: identity that compounds)

  • VoiceprintStore (voiceprints.json): load → known_voiceprints; on response, store/refresh fingerprint for visual/high-confidence speakers; skip Unknown_N.
  • Verify the three resolution paths against the backend: (a) visual name wins; (b) voiceprint recovery for a camera-off / OCR-missed speaker; (c) Unknown_N when neither matches.
  • Voiceprint manager in the UI (rename, delete, reset).
  • Milestone: a regular contact gets named on a later call with their camera off, purely from a stored voiceprint.

Phase 7 — Polish (milestone: daily-driver quality)

  • Recent-sessions list (open/resend/delete); alias-table editor.
  • Optional on-screen "recording" indicator toggle.
  • Adapter health: warn when detection confidence drops sharply (catches UI redesigns early); keep fixtures for quick re-tuning.
  • Login-item / launch-at-login.

Risk register

  1. UI fragility (highest ongoing cost): apps redesign speaking cues / name placement. Isolate all pixel/color/label logic in adapters; Phase-7 confidence-drop warning + fixtures keep re-tuning fast.
  2. Active-speaker lag / crosstalk: UI cue trails speech and flickers. Mitigated by treating visual as a prior (the backend votes), hysteresis, and the voiceprint fallback.
  3. Meet-in-browser: canvas video + DOM names; the hybrid adapter needs the most iteration — budget extra time.
  4. 4-speaker-per-chunk cap (Sortformer): within a single chunk, >4 simultaneous speakers can't all be separated. Chunking + voiceprints + the visual timeline mitigate across chunks; for rare large calls, note it as a known limit.
  5. Sequential GPU constraint: never fire parallel audio requests; queue them.
  6. Permission re-prompts on rebuild: stable signing identity.

Definition of done (v1)

Join a call in any of the four apps → app auto-records dual-track audio, mixes to mono, and builds a visual speaker timeline (incl. mic-VAD self spans) → on hang-up the audio + timeline go to label-merge (chunked + sequential as needed) → a speakers.json with real names and accurate boundaries is produced, the voiceprint library is updated, and no video is ever written to disk and no transcription/diarization/merge logic lives in this app.