dda4322de7
Document the dual-channel label-merge path (mic_file/system_file/self_name/self_vad) and the recap phase (transcript.md + recap.html via the backend LLM) across docs/01-03; correct docs/02 $2.10 to the UI actually shipped; mark docs/01 $7 open items as settled; remove the dead AUDIO_API.md references; note the manifest sha256 fields are not emitted; mark docs/04 as a complete/historical build log. Also drop the last stale "Phase 0" UI string in MenuBarView and retire the now-done doc-debt items in ROADMAP.
266 lines
16 KiB
Markdown
266 lines
16 KiB
Markdown
# Architecture — Ten31 Transcripts
|
||
|
||
Companion to `01_PROJECT_BRIEF.md`. Module layout, data flow, the per-app adapter
|
||
pattern, the macOS APIs, and the SparkControl integration (now fully specified).
|
||
|
||
---
|
||
|
||
## 1. High-level data flow
|
||
|
||
```
|
||
┌─────────────────────────────────────────┐
|
||
│ CallDetector │
|
||
│ CoreAudio "mic running somewhere" │
|
||
│ + known-app / Meet-tab heuristic │
|
||
└───────────────┬───────────────────────────┘
|
||
│ callStarted(app, window)
|
||
▼
|
||
┌──────────────────────────────────────────────────────────┐
|
||
│ SessionController │
|
||
│ owns one Session; shared t0; start/stop; on end package │
|
||
└───────┬───────────────────────────┬──────────────────────┘
|
||
│ │
|
||
▼ ▼
|
||
┌────────────────────────┐ ┌───────────────────────────────────┐
|
||
│ AudioRecorder │ │ VisualObserver │
|
||
│ SCStream system audio │ │ SCStream window frames @2–4fps │
|
||
│ AVAudioEngine mic │ │ │ (frames released, never saved)│
|
||
│ → mic.wav, system.wav │ │ ▼ │
|
||
│ → mixed_mono_16k.wav │ │ AppAdapter (per app) │
|
||
│ + mic VAD → self spans │ │ OCR names + active-speaker cue │
|
||
└────────────┬───────────┘ │ → SpeakerObservation │
|
||
│ └──────────────┬────────────────────┘
|
||
│ ▼
|
||
│ ┌───────────────────────────┐
|
||
│ self spans ───▶│ TimelineBuilder │
|
||
│ │ debounce/coalesce + merge │
|
||
│ │ mic-VAD self spans │
|
||
│ │ → visual_timeline.json │
|
||
│ └──────────────┬────────────┘
|
||
│ │
|
||
▼ ▼
|
||
┌──────────────────────────────────────────────────────────┐
|
||
│ SessionPackager │
|
||
│ mixed_mono_16k.wav + visual_timeline.json + manifest │
|
||
│ + chunk plan (if call > ~3 min) │
|
||
└───────────────────────────┬──────────────────────────────┘
|
||
▼
|
||
┌──────────────────────────────────────┐
|
||
│ SparkControlClient │
|
||
│ per chunk, SEQUENTIAL: │
|
||
│ POST /api/audio/label-merge │
|
||
│ file=chunk.wav │
|
||
│ timeline=<chunk-local segments> │
|
||
│ known_voiceprints=<from store> │
|
||
│ transcribe=true │
|
||
│ → named segments + per-speaker prints │
|
||
└───────────────┬───────────┬────────────┘
|
||
│ │
|
||
offset+stitch│ │ fingerprints (keyed by name)
|
||
▼ ▼
|
||
┌────────────────┐ ┌────────────────────┐
|
||
│ speakers.json │ │ VoiceprintStore │
|
||
│ (named, global) │ │ persist/replay │
|
||
└────────────────┘ └────────────────────┘
|
||
```
|
||
|
||
(After `speakers.json`, a recap phase renders `transcript.md` + `recap.html` via
|
||
the backend LLM — see §2.11.)
|
||
|
||
## 2. Modules
|
||
|
||
### 2.1 `CallDetector`
|
||
Fire `callStarted(app:window:)` / `callEnded()`.
|
||
- **Mic active system-wide:** CoreAudio `kAudioDevicePropertyDeviceIsRunningSomewhere`
|
||
on the default input device (listener, not poll).
|
||
- **App present/active:** `NSWorkspace` running/frontmost vs the bundle-ID table.
|
||
- **Meet (browser):** when a browser is frontmost + mic live, read the active-tab
|
||
URL (AppleScript/Accessibility); confirm `meet.google.com`.
|
||
- **Heuristic:** `mic_running` AND (`known_native_app_active` OR `browser+meet_tab`).
|
||
Debounce ~2 s open; end when mic quiet > N s and the app/tab leaves foreground
|
||
or quits.
|
||
- Output: app id + the call window (`SCWindow`) for the `VisualObserver`.
|
||
|
||
### 2.2 `AudioRecorder`
|
||
- **System audio:** `SCStream` with `capturesAudio = true` (mixer-level; works with
|
||
headphones; no BlackHole). macOS 13+.
|
||
- **Mic:** `AVAudioEngine` input tap.
|
||
- Outputs: `mic.wav`, `system.wav`, and the backend deliverable
|
||
**`mixed_mono_16k.wav`** (mic + system summed → mono → 16 kHz PCM WAV).
|
||
- **Shared `t0`** (`CACurrentMediaTime`) stamped once; every audio sample and
|
||
visual observation is relative to it. Non-negotiable — the merge depends on it.
|
||
- **Mic VAD:** run lightweight energy/VAD on the mic track to emit "the user is
|
||
speaking" spans. These feed `TimelineBuilder` as pre-seeded **self** segments
|
||
(high confidence) so the backend names the user even when their own tile isn't
|
||
read — and so the user's voiceprint enrolls on call one.
|
||
|
||
### 2.3 `VisualObserver`
|
||
- `SCStream` scoped (via `SCContentFilter`) to the **specific call window**.
|
||
- Throttle to adapter fps (default 3). Hand each frame to the adapter; **release
|
||
immediately — never persist a frame.**
|
||
- **Window visibility / focus is NOT required.** SCK captures a window's own
|
||
rendered content even when it's in the background, occluded by other apps, or
|
||
on another Space. The user can work in other apps during the call and visual
|
||
capture continues normally. (This is a key reason for window capture over
|
||
display capture — also more private.)
|
||
- **Capture liveness — the one real failure mode.** Two states stop fresh frames:
|
||
1. **Minimized to the Dock** — macOS may freeze the window's backing buffer, so
|
||
SCK delivers stale/duplicate frames. Detect minimization
|
||
(`SCWindow.isOnScreen == false` / window state) and **pause visual analysis +
|
||
flag a `visual_gap` for that span** rather than emitting bogus observations.
|
||
2. **Browser tab switched away (Meet only)** — see §2.4 Meet note.
|
||
In both cases **audio keeps recording**, and the backend voiceprint fallback
|
||
still names previously-heard speakers — so a gap only costs naming precision for
|
||
*new, never-seen* speakers during that exact window. Record gaps in
|
||
`visual_timeline.json` (a `visual_gaps: [{start,end,reason}]` array) so the
|
||
cause is auditable; `TimelineBuilder` must not interpolate across a gap.
|
||
|
||
### 2.4 `AppAdapter` (protocol) + four implementations
|
||
```swift
|
||
protocol AppAdapter {
|
||
static var bundleIDs: [String] { get }
|
||
var preferredFPS: Int { get }
|
||
func analyze(frame: CVPixelBuffer, at t: TimeInterval) -> [SpeakerObservation]
|
||
func namesFromAccessibility() -> [String]? // optional
|
||
}
|
||
struct SpeakerObservation {
|
||
let name: String // OCR'd / a11y name; "" if unknown
|
||
let speaking: Bool // active-speaker cue detected
|
||
let bbox: CGRect
|
||
let confidence: Double // 0..1
|
||
let t: TimeInterval // relative to session t0
|
||
}
|
||
```
|
||
Per-adapter cues:
|
||
- **Zoom** (`us.zoom.xos`): colored tile border = active speaker; OCR the tile
|
||
name label; handle speaker + gallery layouts.
|
||
- **Teams** (`com.microsoft.teams2`): colored ring/border; labeled; like Zoom.
|
||
- **Signal** (`org.whispersystems.signal-desktop`): ring around avatar/initials;
|
||
try `namesFromAccessibility()` first (Electron a11y tree), OCR fallback.
|
||
- **Meet** (browser): **hybrid** — names via Accessibility/AppleScript (DOM text),
|
||
speaking cue via Vision (canvas/WebGL animated bars / tile highlight), fused by
|
||
tile position. Most likely to need iteration.
|
||
- **Tab-switch caveat (Meet-specific):** if Meet is a browser *tab* and the user
|
||
switches to a different tab **in the same window**, the browser stops rendering
|
||
the Meet tab → SCK captures a frozen last-frame (a `visual_gap`). Switching to a
|
||
different *app* is fine; switching tabs is not. Mitigations, in order: (1)
|
||
detect the active-tab URL leaving `meet.google.com` and flag a `visual_gap`
|
||
(don't emit stale observations); (2) prefer capturing Meet in a **dedicated
|
||
browser window / PWA / standalone window** so tab-switching can't blank it
|
||
— surface this as a one-time setup tip in the UI; (3) names still come from the
|
||
a11y/DOM tree where available, and audio + voiceprint fallback carry identity
|
||
through the gap regardless.
|
||
|
||
Each adapter is **testable offline** against PNG/JPEG frame fixtures.
|
||
|
||
### 2.5 `TimelineBuilder`
|
||
Turn noisy per-frame observations into clean `(start, end, name, confidence)`
|
||
segments.
|
||
- Group by name; open a segment after K consecutive speaking frames (e.g. 2),
|
||
close after M quiet frames (e.g. 2) — hysteresis rides out UI-cue lag/flicker.
|
||
- **Allow overlaps** (crosstalk). Do not force one speaker per instant.
|
||
- Merge in the mic-VAD **self** spans (the user) with high confidence.
|
||
- Normalize OCR name variants ("Sarah J" → "Sarah Jones") via a per-session
|
||
alias table.
|
||
- Emit `visual_timeline.json` (schema in `03_DATA_CONTRACTS.md`). The flat
|
||
`segments` array maps directly onto the `timeline` field `label-merge` wants.
|
||
|
||
### 2.6 `SessionPackager`
|
||
Write the session folder and, if the call is longer than ~3 min, produce a
|
||
**chunk plan**: ~2–3 min windows on `mixed_mono_16k.wav`, each with its
|
||
**timeline slice rebased to chunk-local seconds**.
|
||
```
|
||
~/Ten31Transcripts/sessions/2026-06-05T14-03_zoom/
|
||
mic.wav system.wav mixed_mono_16k.wav
|
||
visual_timeline.json
|
||
manifest.json
|
||
(chunks/ produced transiently if chunking)
|
||
speakers.json # written after backend hand-off
|
||
```
|
||
|
||
### 2.7 `SparkControlClient`
|
||
Deliver to SparkControl. **Primary path = `POST /api/audio/label-merge`**. Sends
|
||
**dual-channel** (`mic_file` + `system_file` + `self_name` + `self_vad`) when the
|
||
system track is healthy, else the **mono** `file`; always with `timeline`,
|
||
`known_voiceprints`, `transcribe=true`.
|
||
- **Sequential only** — one audio request in flight (parallel ⇒ `503 + Retry-After`).
|
||
- **Self-signed TLS** — skip verification (`URLSession` delegate trusting the
|
||
Start9 cert) or trust the Root CA. **No auth on the LAN.**
|
||
- **Per chunk:** call `label-merge` with that chunk's audio + rebased timeline +
|
||
the **accumulated** voiceprints; offset returned timestamps back to global and
|
||
append. Names unify across chunks because the same names/voiceprints are passed
|
||
each time; new voiceprints accumulate into the store.
|
||
- Retry on `503` after `Retry-After`; on hard failure keep the session folder and
|
||
surface "Resend" in the UI.
|
||
- Limits to respect: **200 MB/request** (`413`), transcription timeout ~300 s,
|
||
diarization ~600 s. Chunking keeps requests well under these.
|
||
- See `03_DATA_CONTRACTS.md §4` for exact fields and a real response.
|
||
|
||
### 2.8 result assembly → `speakers.json`
|
||
Concatenate the per-chunk `label-merge` results into one global, named,
|
||
speaker-attributed transcript (timestamps offset to session time). This is the
|
||
seam to the user's existing summarizer. The app does not analyze past this.
|
||
|
||
### 2.9 `VoiceprintStore`
|
||
Local persistence of named voiceprints — the compounding-identity layer.
|
||
- File: `~/Ten31Transcripts/voiceprints.json` →
|
||
`{ "<name>": { "vector": [192 floats], "updated": <iso>, "calls": <int> } }`.
|
||
- **On send:** load all entries → pass as `known_voiceprints` to `label-merge`.
|
||
- **On response:** for each speaker resolved by **visual** (or a high-similarity
|
||
**voiceprint** match), store/refresh that name's vector. **Never** store
|
||
`Unknown_N`.
|
||
- **Update policy (open, start simple):** overwrite with the latest
|
||
high-confidence vector, or keep a running mean per name. v1 = store latest with
|
||
`overlap_confidence ≥ ~0.8`; refine to averaging later (`01 §7.4`).
|
||
- Editable/clearable from the menu-bar UI (rename, delete a person, reset).
|
||
|
||
### 2.10 `MenuBarUI` (SwiftUI, `LSUIElement`)
|
||
Status (idle / detected / recording / finishing), manual start/stop with live
|
||
mic/system level meters, and the **last session** — reveal in Finder, resend
|
||
("Send to backend"), open recap, and edit speakers — plus "Open saved session…"
|
||
to reprocess an existing folder. Also a **backend host + health check**
|
||
(`GET /api/status`), adapter toggles, output folder, and a permissions checklist
|
||
(Microphone, Screen Recording, Accessibility). (No multi-session list or
|
||
voiceprint-manager UI yet — those are in `ROADMAP.md`.)
|
||
|
||
### 2.11 Recap (`RecapAnalyzer`, `RecapRenderer`)
|
||
After `speakers.json`, the recap phase turns the named transcript into the
|
||
human-readable deliverables. `RecapAnalyzer` calls the backend LLM
|
||
(`POST /v1/chat/completions`, Qwen3) for topics + meeting extras; `RecapRenderer`
|
||
writes `transcript.md` (one line per diarized utterance) and `recap.html` (+ a
|
||
`recap.json` sidecar). The in-app speaker editor (`SpeakerEditing` /
|
||
`RecapEditModel`) rewrites names across all outputs after the fact. All
|
||
language-model work stays on the backend; the app orchestrates and renders.
|
||
|
||
## 3. macOS frameworks & permissions
|
||
|
||
| Need | Framework | Permission |
|
||
|---|---|---|
|
||
| System audio + window frames | ScreenCaptureKit | Screen Recording |
|
||
| Microphone | AVFoundation / CoreAudio | Microphone |
|
||
| Meet/Signal names, tab URL | Accessibility (AXUIElement) / AppleScript | Accessibility + Automation |
|
||
| OCR + cue analysis | Vision (`VNRecognizeTextRequest`) | none |
|
||
| App/tab detection | AppKit (`NSWorkspace`) | none |
|
||
|
||
Stable signing identity avoids permission re-prompts on rebuild.
|
||
|
||
## 4. Performance
|
||
Window-scoped capture + 3 fps + Vision-on-Neural-Engine is light; audio is cheap;
|
||
frames are released immediately so memory stays flat. The app idles near-zero
|
||
until a call starts. Backend requests are sequential and chunked, so they never
|
||
saturate the GPU.
|
||
|
||
## 5. The merge — now done by the backend
|
||
The app no longer implements the overlap vote. `label-merge` resolves each
|
||
anonymous cluster in order:
|
||
1. **visual** — timeline name with the most temporal overlap (`source: "visual"`,
|
||
`overlap_confidence`);
|
||
2. **voiceprint** — closest `known_voiceprints` match above `voiceprint_threshold`
|
||
(`source: "voiceprint"`, `match_similarity`);
|
||
3. **`Unknown_N`** (`source: "unmatched"`) — never guessed/mislabeled.
|
||
The app's contribution is a good timeline (incl. mic-VAD self spans) and an
|
||
ever-growing voiceprint library. `min_overlap` and `voiceprint_threshold` are
|
||
tunable request params if precision needs adjusting.
|
||
```
|
||
```
|