Files

T

Grant Gilliam b2ae3a62b9 Phase 0: menu-bar scaffold, permissions, backend health check

Native SwiftUI menu-bar app (LSUIElement, macOS 13+), generated from project.yml
via XcodeGen. Includes:
- PermissionsManager (Microphone / Screen Recording / Accessibility) + UI
- SparkControlHealth: GET /api/status over self-signed TLS (InsecureTrustDelegate)
- AppSettings persistence (host, TLS-skip, output folder, adapter toggles)
- Menu-bar panel + Settings, app sandbox & hardened runtime off (LAN tool)

2026-06-05 19:33:53 -05:00

6.7 KiB

Raw Blame History

Build Plan — Ten31 Transcripts

Companion to docs 01–03. Phased plan for the Claude Code session, each phase with a demoable milestone. Build in order; the risky/novel work (visual adapters) is isolated for independent tuning. The SparkControl contract is now known (03_DATA_CONTRACTS.md), so Phase 5 wires the real endpoints.

Ground rules

Native Swift + SwiftUI, menu-bar app (LSUIElement = true). macOS 13.0+ (ScreenCaptureKit system audio).
Never write video frames to disk. Process in-memory, release immediately; assert this in VisualObserver review.
One shared monotonic t0 for audio + visual timestamps — wire first.
Backend deliverable = mixed_mono_16k.wav. Keep mic.wav/system.wav locally (mic = the user, and the VAD source for self-labeling).
SparkControl calls are SEQUENTIAL (one audio request in flight) over self-signed TLS, no auth, primary endpoint POST /api/audio/label-merge.
Every adapter must be testable offline against image fixtures.

Phase 0 — Scaffold, permissions, backend ping (milestone: launches + green)

Menu-bar shell; permissions onboarding (Screen Recording, Microphone, Accessibility) with status + deep links.
Settings: output folder, adapter toggles, SparkControl base URL + TLS-skip toggle.
GET /api/status health check shown in the menu.
Milestone: app sits in the menu bar; permissions green; backend reachable.

Phase 1 — Audio capture + mix (milestone: clean mixed-mono WAV)

AudioRecorder: SCK system audio → system.wav; AVAudioEngine mic → mic.wav; shared t0; produce mixed_mono_16k.wav.
Mic VAD → user "self" speaking spans (held for the timeline).
Manual start/stop for now.
Milestone: record any call manually → tracks line up, mixed-mono plays back clean, mic-VAD spans look right.

Phase 2 — Call detection (milestone: hands-free start/stop)

CallDetector: CoreAudio mic-running listener + NSWorkspace app/frontmost + Meet active-tab-URL check; debounce.
Wire detector → SessionController → AudioRecorder.
Milestone: joining a call in any of the four apps auto-starts; leaving auto-stops; a quick mic test does not false-trigger.

Phase 3 — VisualObserver + Zoom adapter (milestone: visual_timeline.json)

VisualObserver: window-scoped SCStream @3 fps → frames → adapter → release. Background/occluded windows capture fine; detect minimized (and, for Meet, tab-switched) → pause analysis + record a visual_gap, never emit stale frames. Audio keeps recording through gaps.
AppAdapter protocol + ZoomAdapter (tile-border cue + OCR names; speaker & gallery layouts).
TimelineBuilder: hysteresis, overlap-allowed, merge mic-VAD self spans, alias normalization → visual_timeline.json (+ the flat timeline array form).
Offline fixtures: ~30 sample Zoom frames; unit-test/tune thresholds without a live call.
Milestone: a real Zoom call yields a visual_timeline.json whose segments visibly match who spoke (incl. the user from mic VAD).

Phase 4 — Remaining adapters (milestone: all four produce timelines)

TeamsAdapter, SignalAdapter (a11y names first), MeetAdapter (hybrid: a11y names + Vision cue). Fixtures per app; per-adapter fps/threshold config.
Milestone: each app yields a sensible timeline; adapters independently toggleable.

Phase 5 — SparkControl hand-off via label-merge (milestone: end-to-end named transcript)

SessionPackager: write session folder; if call > ~3 min, build the chunk plan (~2–3 min windows) with timeline sliced + rebased to chunk-local seconds.
SparkControlClient:
- Self-signed TLS handling; multipart file; sequential requests.
- Per chunk → POST /api/audio/label-merge with file, chunk timeline, known_voiceprints (from store), transcribe=true.
- Handle 503 + Retry-After (wait, retry), 413, {"detail":...} errors.
- Offset returned start_ms/end_ms back to global seconds; concatenate.
Assemble speakers.json (named, global, speaker-attributed transcript).
Failed sessions stay on disk with a "Resend" affordance.
Milestone: finish a call → mixed_mono_16k.wav + timeline delivered → a speakers.json with real names + accurate boundaries lands locally. Test a

3-min call to exercise chunking + cross-chunk name unification.

Phase 6 — Voiceprint library (milestone: identity that compounds)

VoiceprintStore (voiceprints.json): load → known_voiceprints; on response, store/refresh fingerprint for visual/high-confidence speakers; skip Unknown_N.
Verify the three resolution paths against the backend: (a) visual name wins; (b) voiceprint recovery for a camera-off / OCR-missed speaker; (c) Unknown_N when neither matches.
Voiceprint manager in the UI (rename, delete, reset).
Milestone: a regular contact gets named on a later call with their camera off, purely from a stored voiceprint.

Phase 7 — Polish (milestone: daily-driver quality)

Recent-sessions list (open/resend/delete); alias-table editor.
Optional on-screen "recording" indicator toggle.
Adapter health: warn when detection confidence drops sharply (catches UI redesigns early); keep fixtures for quick re-tuning.
Login-item / launch-at-login.

Risk register

UI fragility (highest ongoing cost): apps redesign speaking cues / name placement. Isolate all pixel/color/label logic in adapters; Phase-7 confidence-drop warning + fixtures keep re-tuning fast.
Active-speaker lag / crosstalk: UI cue trails speech and flickers. Mitigated by treating visual as a prior (the backend votes), hysteresis, and the voiceprint fallback.
Meet-in-browser: canvas video + DOM names; the hybrid adapter needs the most iteration — budget extra time.
4-speaker-per-chunk cap (Sortformer): within a single chunk, >4 simultaneous speakers can't all be separated. Chunking + voiceprints + the visual timeline mitigate across chunks; for rare large calls, note it as a known limit.
Sequential GPU constraint: never fire parallel audio requests; queue them.
Permission re-prompts on rebuild: stable signing identity.

Definition of done (v1)

Join a call in any of the four apps → app auto-records dual-track audio, mixes to mono, and builds a visual speaker timeline (incl. mic-VAD self spans) → on hang-up the audio + timeline go to label-merge (chunked + sequential as needed) → a speakers.json with real names and accurate boundaries is produced, the voiceprint library is updated, and no video is ever written to disk and no transcription/diarization/merge logic lives in this app.

6.7 KiB Raw Blame History Unescape Escape