Native SwiftUI menu-bar app (LSUIElement, macOS 13+), generated from project.yml via XcodeGen. Includes: - PermissionsManager (Microphone / Screen Recording / Accessibility) + UI - SparkControlHealth: GET /api/status over self-signed TLS (InsecureTrustDelegate) - AppSettings persistence (host, TLS-skip, output folder, adapter toggles) - Menu-bar panel + Settings, app sandbox & hardened runtime off (LAN tool)
6.7 KiB
Build Plan — Ten31 Transcripts
Companion to docs 01–03. Phased plan for the Claude Code session, each phase with
a demoable milestone. Build in order; the risky/novel work (visual adapters) is
isolated for independent tuning. The SparkControl contract is now known
(03_DATA_CONTRACTS.md), so Phase 5 wires the real endpoints.
Ground rules
- Native Swift + SwiftUI, menu-bar app (
LSUIElement = true). macOS 13.0+ (ScreenCaptureKit system audio). - Never write video frames to disk. Process in-memory, release immediately;
assert this in
VisualObserverreview. - One shared monotonic
t0for audio + visual timestamps — wire first. - Backend deliverable =
mixed_mono_16k.wav. Keepmic.wav/system.wavlocally (mic = the user, and the VAD source for self-labeling). - SparkControl calls are SEQUENTIAL (one audio request in flight) over
self-signed TLS, no auth, primary endpoint
POST /api/audio/label-merge. - Every adapter must be testable offline against image fixtures.
Phase 0 — Scaffold, permissions, backend ping (milestone: launches + green)
- Menu-bar shell; permissions onboarding (Screen Recording, Microphone, Accessibility) with status + deep links.
- Settings: output folder, adapter toggles, SparkControl base URL + TLS-skip toggle.
GET /api/statushealth check shown in the menu.- Milestone: app sits in the menu bar; permissions green; backend reachable.
Phase 1 — Audio capture + mix (milestone: clean mixed-mono WAV)
AudioRecorder: SCK system audio →system.wav; AVAudioEngine mic →mic.wav; sharedt0; producemixed_mono_16k.wav.- Mic VAD → user "self" speaking spans (held for the timeline).
- Manual start/stop for now.
- Milestone: record any call manually → tracks line up, mixed-mono plays back clean, mic-VAD spans look right.
Phase 2 — Call detection (milestone: hands-free start/stop)
CallDetector: CoreAudio mic-running listener +NSWorkspaceapp/frontmost + Meet active-tab-URL check; debounce.- Wire detector →
SessionController→AudioRecorder. - Milestone: joining a call in any of the four apps auto-starts; leaving auto-stops; a quick mic test does not false-trigger.
Phase 3 — VisualObserver + Zoom adapter (milestone: visual_timeline.json)
VisualObserver: window-scopedSCStream@3 fps → frames → adapter → release. Background/occluded windows capture fine; detect minimized (and, for Meet, tab-switched) → pause analysis + record avisual_gap, never emit stale frames. Audio keeps recording through gaps.AppAdapterprotocol +ZoomAdapter(tile-border cue + OCR names; speaker & gallery layouts).TimelineBuilder: hysteresis, overlap-allowed, merge mic-VAD self spans, alias normalization →visual_timeline.json(+ the flattimelinearray form).- Offline fixtures: ~30 sample Zoom frames; unit-test/tune thresholds without a live call.
- Milestone: a real Zoom call yields a
visual_timeline.jsonwhose segments visibly match who spoke (incl. the user from mic VAD).
Phase 4 — Remaining adapters (milestone: all four produce timelines)
TeamsAdapter,SignalAdapter(a11y names first),MeetAdapter(hybrid: a11y names + Vision cue). Fixtures per app; per-adapter fps/threshold config.- Milestone: each app yields a sensible timeline; adapters independently toggleable.
Phase 5 — SparkControl hand-off via label-merge (milestone: end-to-end named transcript)
SessionPackager: write session folder; if call > ~3 min, build the chunk plan (~2–3 min windows) with timeline sliced + rebased to chunk-local seconds.SparkControlClient:- Self-signed TLS handling; multipart
file; sequential requests. - Per chunk →
POST /api/audio/label-mergewithfile, chunktimeline,known_voiceprints(from store),transcribe=true. - Handle
503 + Retry-After(wait, retry),413,{"detail":...}errors. - Offset returned
start_ms/end_msback to global seconds; concatenate.
- Self-signed TLS handling; multipart
- Assemble
speakers.json(named, global, speaker-attributed transcript). - Failed sessions stay on disk with a "Resend" affordance.
- Milestone: finish a call →
mixed_mono_16k.wav+ timeline delivered → aspeakers.jsonwith real names + accurate boundaries lands locally. Test a3-min call to exercise chunking + cross-chunk name unification.
Phase 6 — Voiceprint library (milestone: identity that compounds)
VoiceprintStore(voiceprints.json): load →known_voiceprints; on response, store/refreshfingerprintforvisual/high-confidence speakers; skipUnknown_N.- Verify the three resolution paths against the backend: (a) visual name wins;
(b) voiceprint recovery for a camera-off / OCR-missed speaker; (c)
Unknown_Nwhen neither matches. - Voiceprint manager in the UI (rename, delete, reset).
- Milestone: a regular contact gets named on a later call with their camera off, purely from a stored voiceprint.
Phase 7 — Polish (milestone: daily-driver quality)
- Recent-sessions list (open/resend/delete); alias-table editor.
- Optional on-screen "recording" indicator toggle.
- Adapter health: warn when detection confidence drops sharply (catches UI redesigns early); keep fixtures for quick re-tuning.
- Login-item / launch-at-login.
Risk register
- UI fragility (highest ongoing cost): apps redesign speaking cues / name placement. Isolate all pixel/color/label logic in adapters; Phase-7 confidence-drop warning + fixtures keep re-tuning fast.
- Active-speaker lag / crosstalk: UI cue trails speech and flickers. Mitigated by treating visual as a prior (the backend votes), hysteresis, and the voiceprint fallback.
- Meet-in-browser: canvas video + DOM names; the hybrid adapter needs the most iteration — budget extra time.
- 4-speaker-per-chunk cap (Sortformer): within a single chunk, >4 simultaneous speakers can't all be separated. Chunking + voiceprints + the visual timeline mitigate across chunks; for rare large calls, note it as a known limit.
- Sequential GPU constraint: never fire parallel audio requests; queue them.
- Permission re-prompts on rebuild: stable signing identity.
Definition of done (v1)
Join a call in any of the four apps → app auto-records dual-track audio, mixes to
mono, and builds a visual speaker timeline (incl. mic-VAD self spans) → on
hang-up the audio + timeline go to label-merge (chunked + sequential as needed)
→ a speakers.json with real names and accurate boundaries is produced, the
voiceprint library is updated, and no video is ever written to disk and
no transcription/diarization/merge logic lives in this app.