Files
ten31-transcripts/docs/04_BUILD_PLAN.md
T
Grant Gilliam b2ae3a62b9 Phase 0: menu-bar scaffold, permissions, backend health check
Native SwiftUI menu-bar app (LSUIElement, macOS 13+), generated from project.yml
via XcodeGen. Includes:
- PermissionsManager (Microphone / Screen Recording / Accessibility) + UI
- SparkControlHealth: GET /api/status over self-signed TLS (InsecureTrustDelegate)
- AppSettings persistence (host, TLS-skip, output folder, adapter toggles)
- Menu-bar panel + Settings, app sandbox & hardened runtime off (LAN tool)
2026-06-05 19:33:53 -05:00

124 lines
6.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Build Plan — Ten31 Transcripts
Companion to docs 0103. Phased plan for the Claude Code session, each phase with
a demoable milestone. Build in order; the risky/novel work (visual adapters) is
isolated for independent tuning. The SparkControl contract is now known
(`03_DATA_CONTRACTS.md`), so Phase 5 wires the real endpoints.
---
## Ground rules
- **Native Swift + SwiftUI**, menu-bar app (`LSUIElement = true`). macOS 13.0+
(ScreenCaptureKit system audio).
- **Never write video frames to disk.** Process in-memory, release immediately;
assert this in `VisualObserver` review.
- **One shared monotonic `t0`** for audio + visual timestamps — wire first.
- **Backend deliverable = `mixed_mono_16k.wav`.** Keep `mic.wav`/`system.wav`
locally (mic = the user, and the VAD source for self-labeling).
- **SparkControl calls are SEQUENTIAL** (one audio request in flight) over
self-signed TLS, **no auth**, primary endpoint `POST /api/audio/label-merge`.
- Every adapter must be testable offline against image fixtures.
---
## Phase 0 — Scaffold, permissions, backend ping (milestone: launches + green)
- Menu-bar shell; permissions onboarding (Screen Recording, Microphone,
Accessibility) with status + deep links.
- Settings: output folder, adapter toggles, **SparkControl base URL** + TLS-skip
toggle.
- `GET /api/status` health check shown in the menu.
- **Milestone:** app sits in the menu bar; permissions green; backend reachable.
## Phase 1 — Audio capture + mix (milestone: clean mixed-mono WAV)
- `AudioRecorder`: SCK system audio → `system.wav`; AVAudioEngine mic → `mic.wav`;
shared `t0`; produce **`mixed_mono_16k.wav`**.
- Mic VAD → user "self" speaking spans (held for the timeline).
- Manual start/stop for now.
- **Milestone:** record any call manually → tracks line up, mixed-mono plays back
clean, mic-VAD spans look right.
## Phase 2 — Call detection (milestone: hands-free start/stop)
- `CallDetector`: CoreAudio mic-running listener + `NSWorkspace` app/frontmost +
Meet active-tab-URL check; debounce.
- Wire detector → `SessionController``AudioRecorder`.
- **Milestone:** joining a call in any of the four apps auto-starts; leaving
auto-stops; a quick mic test does not false-trigger.
## Phase 3 — VisualObserver + Zoom adapter (milestone: visual_timeline.json)
- `VisualObserver`: window-scoped `SCStream` @3 fps → frames → adapter → release.
Background/occluded windows capture fine; detect **minimized** (and, for Meet,
**tab-switched**) → pause analysis + record a `visual_gap`, never emit stale
frames. Audio keeps recording through gaps.
- `AppAdapter` protocol + `ZoomAdapter` (tile-border cue + OCR names; speaker &
gallery layouts).
- `TimelineBuilder`: hysteresis, overlap-allowed, merge mic-VAD self spans, alias
normalization → `visual_timeline.json` (+ the flat `timeline` array form).
- **Offline fixtures:** ~30 sample Zoom frames; unit-test/tune thresholds without
a live call.
- **Milestone:** a real Zoom call yields a `visual_timeline.json` whose segments
visibly match who spoke (incl. the user from mic VAD).
## Phase 4 — Remaining adapters (milestone: all four produce timelines)
- `TeamsAdapter`, `SignalAdapter` (a11y names first), `MeetAdapter` (hybrid: a11y
names + Vision cue). Fixtures per app; per-adapter fps/threshold config.
- **Milestone:** each app yields a sensible timeline; adapters independently
toggleable.
## Phase 5 — SparkControl hand-off via label-merge (milestone: end-to-end named transcript)
- `SessionPackager`: write session folder; if call > ~3 min, build the chunk plan
(~23 min windows) with **timeline sliced + rebased to chunk-local seconds**.
- `SparkControlClient`:
- Self-signed TLS handling; multipart `file`; **sequential** requests.
- Per chunk → `POST /api/audio/label-merge` with `file`, chunk `timeline`,
`known_voiceprints` (from store), `transcribe=true`.
- Handle `503 + Retry-After` (wait, retry), `413`, `{"detail":...}` errors.
- Offset returned `start_ms/end_ms` back to global seconds; concatenate.
- Assemble `speakers.json` (named, global, speaker-attributed transcript).
- Failed sessions stay on disk with a "Resend" affordance.
- **Milestone:** finish a call → `mixed_mono_16k.wav` + timeline delivered → a
`speakers.json` with real names + accurate boundaries lands locally. Test a
>3-min call to exercise chunking + cross-chunk name unification.
## Phase 6 — Voiceprint library (milestone: identity that compounds)
- `VoiceprintStore` (`voiceprints.json`): load → `known_voiceprints`; on response,
store/refresh `fingerprint` for `visual`/high-confidence speakers; skip
`Unknown_N`.
- Verify the three resolution paths against the backend: (a) **visual** name wins;
(b) **voiceprint** recovery for a camera-off / OCR-missed speaker; (c)
**`Unknown_N`** when neither matches.
- Voiceprint manager in the UI (rename, delete, reset).
- **Milestone:** a regular contact gets named on a later call **with their camera
off**, purely from a stored voiceprint.
## Phase 7 — Polish (milestone: daily-driver quality)
- Recent-sessions list (open/resend/delete); alias-table editor.
- Optional on-screen "recording" indicator toggle.
- Adapter health: warn when detection confidence drops sharply (catches UI
redesigns early); keep fixtures for quick re-tuning.
- Login-item / launch-at-login.
---
## Risk register
1. **UI fragility (highest ongoing cost):** apps redesign speaking cues / name
placement. Isolate all pixel/color/label logic in adapters; Phase-7
confidence-drop warning + fixtures keep re-tuning fast.
2. **Active-speaker lag / crosstalk:** UI cue trails speech and flickers. Mitigated
by treating visual as a *prior* (the backend votes), hysteresis, and the
voiceprint fallback.
3. **Meet-in-browser:** canvas video + DOM names; the hybrid adapter needs the most
iteration — budget extra time.
4. **4-speaker-per-chunk cap (Sortformer):** within a single chunk, >4 simultaneous
speakers can't all be separated. Chunking + voiceprints + the visual timeline
mitigate across chunks; for rare large calls, note it as a known limit.
5. **Sequential GPU constraint:** never fire parallel audio requests; queue them.
6. **Permission re-prompts on rebuild:** stable signing identity.
## Definition of done (v1)
Join a call in any of the four apps → app auto-records dual-track audio, mixes to
mono, and builds a visual speaker timeline (incl. mic-VAD self spans) → on
hang-up the audio + timeline go to `label-merge` (chunked + sequential as needed)
→ a `speakers.json` with **real names and accurate boundaries** is produced, the
**voiceprint library is updated**, and **no video is ever written to disk** and
**no transcription/diarization/merge logic lives in this app.**