Phase 0: menu-bar scaffold, permissions, backend health check

Native SwiftUI menu-bar app (LSUIElement, macOS 13+), generated from project.yml
via XcodeGen. Includes:
- PermissionsManager (Microphone / Screen Recording / Accessibility) + UI
- SparkControlHealth: GET /api/status over self-signed TLS (InsecureTrustDelegate)
- AppSettings persistence (host, TLS-skip, output folder, adapter toggles)
- Menu-bar panel + Settings, app sandbox & hardened runtime off (LAN tool)
This commit is contained in:
Grant Gilliam
2026-06-05 19:33:53 -05:00
commit b2ae3a62b9
19 changed files with 1448 additions and 0 deletions
+123
View File
@@ -0,0 +1,123 @@
# Build Plan — Ten31 Transcripts
Companion to docs 0103. Phased plan for the Claude Code session, each phase with
a demoable milestone. Build in order; the risky/novel work (visual adapters) is
isolated for independent tuning. The SparkControl contract is now known
(`03_DATA_CONTRACTS.md`), so Phase 5 wires the real endpoints.
---
## Ground rules
- **Native Swift + SwiftUI**, menu-bar app (`LSUIElement = true`). macOS 13.0+
(ScreenCaptureKit system audio).
- **Never write video frames to disk.** Process in-memory, release immediately;
assert this in `VisualObserver` review.
- **One shared monotonic `t0`** for audio + visual timestamps — wire first.
- **Backend deliverable = `mixed_mono_16k.wav`.** Keep `mic.wav`/`system.wav`
locally (mic = the user, and the VAD source for self-labeling).
- **SparkControl calls are SEQUENTIAL** (one audio request in flight) over
self-signed TLS, **no auth**, primary endpoint `POST /api/audio/label-merge`.
- Every adapter must be testable offline against image fixtures.
---
## Phase 0 — Scaffold, permissions, backend ping (milestone: launches + green)
- Menu-bar shell; permissions onboarding (Screen Recording, Microphone,
Accessibility) with status + deep links.
- Settings: output folder, adapter toggles, **SparkControl base URL** + TLS-skip
toggle.
- `GET /api/status` health check shown in the menu.
- **Milestone:** app sits in the menu bar; permissions green; backend reachable.
## Phase 1 — Audio capture + mix (milestone: clean mixed-mono WAV)
- `AudioRecorder`: SCK system audio → `system.wav`; AVAudioEngine mic → `mic.wav`;
shared `t0`; produce **`mixed_mono_16k.wav`**.
- Mic VAD → user "self" speaking spans (held for the timeline).
- Manual start/stop for now.
- **Milestone:** record any call manually → tracks line up, mixed-mono plays back
clean, mic-VAD spans look right.
## Phase 2 — Call detection (milestone: hands-free start/stop)
- `CallDetector`: CoreAudio mic-running listener + `NSWorkspace` app/frontmost +
Meet active-tab-URL check; debounce.
- Wire detector → `SessionController``AudioRecorder`.
- **Milestone:** joining a call in any of the four apps auto-starts; leaving
auto-stops; a quick mic test does not false-trigger.
## Phase 3 — VisualObserver + Zoom adapter (milestone: visual_timeline.json)
- `VisualObserver`: window-scoped `SCStream` @3 fps → frames → adapter → release.
Background/occluded windows capture fine; detect **minimized** (and, for Meet,
**tab-switched**) → pause analysis + record a `visual_gap`, never emit stale
frames. Audio keeps recording through gaps.
- `AppAdapter` protocol + `ZoomAdapter` (tile-border cue + OCR names; speaker &
gallery layouts).
- `TimelineBuilder`: hysteresis, overlap-allowed, merge mic-VAD self spans, alias
normalization → `visual_timeline.json` (+ the flat `timeline` array form).
- **Offline fixtures:** ~30 sample Zoom frames; unit-test/tune thresholds without
a live call.
- **Milestone:** a real Zoom call yields a `visual_timeline.json` whose segments
visibly match who spoke (incl. the user from mic VAD).
## Phase 4 — Remaining adapters (milestone: all four produce timelines)
- `TeamsAdapter`, `SignalAdapter` (a11y names first), `MeetAdapter` (hybrid: a11y
names + Vision cue). Fixtures per app; per-adapter fps/threshold config.
- **Milestone:** each app yields a sensible timeline; adapters independently
toggleable.
## Phase 5 — SparkControl hand-off via label-merge (milestone: end-to-end named transcript)
- `SessionPackager`: write session folder; if call > ~3 min, build the chunk plan
(~23 min windows) with **timeline sliced + rebased to chunk-local seconds**.
- `SparkControlClient`:
- Self-signed TLS handling; multipart `file`; **sequential** requests.
- Per chunk → `POST /api/audio/label-merge` with `file`, chunk `timeline`,
`known_voiceprints` (from store), `transcribe=true`.
- Handle `503 + Retry-After` (wait, retry), `413`, `{"detail":...}` errors.
- Offset returned `start_ms/end_ms` back to global seconds; concatenate.
- Assemble `speakers.json` (named, global, speaker-attributed transcript).
- Failed sessions stay on disk with a "Resend" affordance.
- **Milestone:** finish a call → `mixed_mono_16k.wav` + timeline delivered → a
`speakers.json` with real names + accurate boundaries lands locally. Test a
>3-min call to exercise chunking + cross-chunk name unification.
## Phase 6 — Voiceprint library (milestone: identity that compounds)
- `VoiceprintStore` (`voiceprints.json`): load → `known_voiceprints`; on response,
store/refresh `fingerprint` for `visual`/high-confidence speakers; skip
`Unknown_N`.
- Verify the three resolution paths against the backend: (a) **visual** name wins;
(b) **voiceprint** recovery for a camera-off / OCR-missed speaker; (c)
**`Unknown_N`** when neither matches.
- Voiceprint manager in the UI (rename, delete, reset).
- **Milestone:** a regular contact gets named on a later call **with their camera
off**, purely from a stored voiceprint.
## Phase 7 — Polish (milestone: daily-driver quality)
- Recent-sessions list (open/resend/delete); alias-table editor.
- Optional on-screen "recording" indicator toggle.
- Adapter health: warn when detection confidence drops sharply (catches UI
redesigns early); keep fixtures for quick re-tuning.
- Login-item / launch-at-login.
---
## Risk register
1. **UI fragility (highest ongoing cost):** apps redesign speaking cues / name
placement. Isolate all pixel/color/label logic in adapters; Phase-7
confidence-drop warning + fixtures keep re-tuning fast.
2. **Active-speaker lag / crosstalk:** UI cue trails speech and flickers. Mitigated
by treating visual as a *prior* (the backend votes), hysteresis, and the
voiceprint fallback.
3. **Meet-in-browser:** canvas video + DOM names; the hybrid adapter needs the most
iteration — budget extra time.
4. **4-speaker-per-chunk cap (Sortformer):** within a single chunk, >4 simultaneous
speakers can't all be separated. Chunking + voiceprints + the visual timeline
mitigate across chunks; for rare large calls, note it as a known limit.
5. **Sequential GPU constraint:** never fire parallel audio requests; queue them.
6. **Permission re-prompts on rebuild:** stable signing identity.
## Definition of done (v1)
Join a call in any of the four apps → app auto-records dual-track audio, mixes to
mono, and builds a visual speaker timeline (incl. mic-VAD self spans) → on
hang-up the audio + timeline go to `label-merge` (chunked + sequential as needed)
→ a `speakers.json` with **real names and accurate boundaries** is produced, the
**voiceprint library is updated**, and **no video is ever written to disk** and
**no transcription/diarization/merge logic lives in this app.**