dda4322de7
Document the dual-channel label-merge path (mic_file/system_file/self_name/self_vad) and the recap phase (transcript.md + recap.html via the backend LLM) across docs/01-03; correct docs/02 $2.10 to the UI actually shipped; mark docs/01 $7 open items as settled; remove the dead AUDIO_API.md references; note the manifest sha256 fields are not emitted; mark docs/04 as a complete/historical build log. Also drop the last stale "Phase 0" UI string in MenuBarView and retire the now-done doc-debt items in ROADMAP.
130 lines
7.0 KiB
Markdown
130 lines
7.0 KiB
Markdown
# Build Plan — Ten31 Transcripts
|
||
|
||
> **Status: COMPLETE (historical).** Phases 0–6 shipped and the app is in daily
|
||
> use; a recap phase (transcript + HTML recap via the backend LLM) was added after
|
||
> this plan was written. Kept as the original build log and as the map for the
|
||
> "Phase N" references in the code comments. Forward-looking work lives in
|
||
> `ROADMAP.md`; current status in `AGENTS.md`.
|
||
|
||
Companion to docs 01–03. Phased plan for the Claude Code session, each phase with
|
||
a demoable milestone. Build in order; the risky/novel work (visual adapters) is
|
||
isolated for independent tuning. The SparkControl contract is now known
|
||
(`03_DATA_CONTRACTS.md`), so Phase 5 wires the real endpoints.
|
||
|
||
---
|
||
|
||
## Ground rules
|
||
- **Native Swift + SwiftUI**, menu-bar app (`LSUIElement = true`). macOS 13.0+
|
||
(ScreenCaptureKit system audio).
|
||
- **Never write video frames to disk.** Process in-memory, release immediately;
|
||
assert this in `VisualObserver` review.
|
||
- **One shared monotonic `t0`** for audio + visual timestamps — wire first.
|
||
- **Backend deliverable = `mixed_mono_16k.wav`.** Keep `mic.wav`/`system.wav`
|
||
locally (mic = the user, and the VAD source for self-labeling).
|
||
- **SparkControl calls are SEQUENTIAL** (one audio request in flight) over
|
||
self-signed TLS, **no auth**, primary endpoint `POST /api/audio/label-merge`.
|
||
- Every adapter must be testable offline against image fixtures.
|
||
|
||
---
|
||
|
||
## Phase 0 — Scaffold, permissions, backend ping (milestone: launches + green)
|
||
- Menu-bar shell; permissions onboarding (Screen Recording, Microphone,
|
||
Accessibility) with status + deep links.
|
||
- Settings: output folder, adapter toggles, **SparkControl base URL** + TLS-skip
|
||
toggle.
|
||
- `GET /api/status` health check shown in the menu.
|
||
- **Milestone:** app sits in the menu bar; permissions green; backend reachable.
|
||
|
||
## Phase 1 — Audio capture + mix (milestone: clean mixed-mono WAV)
|
||
- `AudioRecorder`: SCK system audio → `system.wav`; AVAudioEngine mic → `mic.wav`;
|
||
shared `t0`; produce **`mixed_mono_16k.wav`**.
|
||
- Mic VAD → user "self" speaking spans (held for the timeline).
|
||
- Manual start/stop for now.
|
||
- **Milestone:** record any call manually → tracks line up, mixed-mono plays back
|
||
clean, mic-VAD spans look right.
|
||
|
||
## Phase 2 — Call detection (milestone: hands-free start/stop)
|
||
- `CallDetector`: CoreAudio mic-running listener + `NSWorkspace` app/frontmost +
|
||
Meet active-tab-URL check; debounce.
|
||
- Wire detector → `SessionController` → `AudioRecorder`.
|
||
- **Milestone:** joining a call in any of the four apps auto-starts; leaving
|
||
auto-stops; a quick mic test does not false-trigger.
|
||
|
||
## Phase 3 — VisualObserver + Zoom adapter (milestone: visual_timeline.json)
|
||
- `VisualObserver`: window-scoped `SCStream` @3 fps → frames → adapter → release.
|
||
Background/occluded windows capture fine; detect **minimized** (and, for Meet,
|
||
**tab-switched**) → pause analysis + record a `visual_gap`, never emit stale
|
||
frames. Audio keeps recording through gaps.
|
||
- `AppAdapter` protocol + `ZoomAdapter` (tile-border cue + OCR names; speaker &
|
||
gallery layouts).
|
||
- `TimelineBuilder`: hysteresis, overlap-allowed, merge mic-VAD self spans, alias
|
||
normalization → `visual_timeline.json` (+ the flat `timeline` array form).
|
||
- **Offline fixtures:** ~30 sample Zoom frames; unit-test/tune thresholds without
|
||
a live call.
|
||
- **Milestone:** a real Zoom call yields a `visual_timeline.json` whose segments
|
||
visibly match who spoke (incl. the user from mic VAD).
|
||
|
||
## Phase 4 — Remaining adapters (milestone: all four produce timelines)
|
||
- `TeamsAdapter`, `SignalAdapter` (a11y names first), `MeetAdapter` (hybrid: a11y
|
||
names + Vision cue). Fixtures per app; per-adapter fps/threshold config.
|
||
- **Milestone:** each app yields a sensible timeline; adapters independently
|
||
toggleable.
|
||
|
||
## Phase 5 — SparkControl hand-off via label-merge (milestone: end-to-end named transcript)
|
||
- `SessionPackager`: write session folder; if call > ~3 min, build the chunk plan
|
||
(~2–3 min windows) with **timeline sliced + rebased to chunk-local seconds**.
|
||
- `SparkControlClient`:
|
||
- Self-signed TLS handling; multipart `file`; **sequential** requests.
|
||
- Per chunk → `POST /api/audio/label-merge` with `file`, chunk `timeline`,
|
||
`known_voiceprints` (from store), `transcribe=true`.
|
||
- Handle `503 + Retry-After` (wait, retry), `413`, `{"detail":...}` errors.
|
||
- Offset returned `start_ms/end_ms` back to global seconds; concatenate.
|
||
- Assemble `speakers.json` (named, global, speaker-attributed transcript).
|
||
- Failed sessions stay on disk with a "Resend" affordance.
|
||
- **Milestone:** finish a call → `mixed_mono_16k.wav` + timeline delivered → a
|
||
`speakers.json` with real names + accurate boundaries lands locally. Test a
|
||
>3-min call to exercise chunking + cross-chunk name unification.
|
||
|
||
## Phase 6 — Voiceprint library (milestone: identity that compounds)
|
||
- `VoiceprintStore` (`voiceprints.json`): load → `known_voiceprints`; on response,
|
||
store/refresh `fingerprint` for `visual`/high-confidence speakers; skip
|
||
`Unknown_N`.
|
||
- Verify the three resolution paths against the backend: (a) **visual** name wins;
|
||
(b) **voiceprint** recovery for a camera-off / OCR-missed speaker; (c)
|
||
**`Unknown_N`** when neither matches.
|
||
- Voiceprint manager in the UI (rename, delete, reset).
|
||
- **Milestone:** a regular contact gets named on a later call **with their camera
|
||
off**, purely from a stored voiceprint.
|
||
|
||
## Phase 7 — Polish (milestone: daily-driver quality)
|
||
- Recent-sessions list (open/resend/delete); alias-table editor.
|
||
- Optional on-screen "recording" indicator toggle.
|
||
- Adapter health: warn when detection confidence drops sharply (catches UI
|
||
redesigns early); keep fixtures for quick re-tuning.
|
||
- Login-item / launch-at-login.
|
||
|
||
---
|
||
|
||
## Risk register
|
||
1. **UI fragility (highest ongoing cost):** apps redesign speaking cues / name
|
||
placement. Isolate all pixel/color/label logic in adapters; Phase-7
|
||
confidence-drop warning + fixtures keep re-tuning fast.
|
||
2. **Active-speaker lag / crosstalk:** UI cue trails speech and flickers. Mitigated
|
||
by treating visual as a *prior* (the backend votes), hysteresis, and the
|
||
voiceprint fallback.
|
||
3. **Meet-in-browser:** canvas video + DOM names; the hybrid adapter needs the most
|
||
iteration — budget extra time.
|
||
4. **4-speaker-per-chunk cap (Sortformer):** within a single chunk, >4 simultaneous
|
||
speakers can't all be separated. Chunking + voiceprints + the visual timeline
|
||
mitigate across chunks; for rare large calls, note it as a known limit.
|
||
5. **Sequential GPU constraint:** never fire parallel audio requests; queue them.
|
||
6. **Permission re-prompts on rebuild:** stable signing identity.
|
||
|
||
## Definition of done (v1)
|
||
Join a call in any of the four apps → app auto-records dual-track audio, mixes to
|
||
mono, and builds a visual speaker timeline (incl. mic-VAD self spans) → on
|
||
hang-up the audio + timeline go to `label-merge` (chunked + sequential as needed)
|
||
→ a `speakers.json` with **real names and accurate boundaries** is produced, the
|
||
**voiceprint library is updated**, and **no video is ever written to disk** and
|
||
**no transcription/diarization/merge logic lives in this app.**
|