ten31-transcripts/docs/02_ARCHITECTURE.md

# Architecture — Ten31 Transcripts

Companion to `01_PROJECT_BRIEF.md`. Module layout, data flow, the per-app adapter
pattern, the macOS APIs, and the SparkControl integration (now fully specified).

---

## 1. High-level data flow

```
                    ┌─────────────────────────────────────────┐
                    │              CallDetector                 │
                    │  CoreAudio "mic running somewhere"        │
                    │  + known-app / Meet-tab heuristic         │
                    └───────────────┬───────────────────────────┘
                                    │ callStarted(app, window)
                                    ▼
        ┌──────────────────────────────────────────────────────────┐
        │                    SessionController                       │
        │   owns one Session; shared t0; start/stop; on end package  │
        └───────┬───────────────────────────┬──────────────────────┘
                │                            │
                ▼                            ▼
   ┌────────────────────────┐   ┌───────────────────────────────────┐
   │     AudioRecorder       │   │         VisualObserver            │
   │  SCStream system audio  │   │  SCStream window frames @2–4fps    │
   │  AVAudioEngine mic       │   │     │ (frames released, never saved)│
   │  → mic.wav, system.wav   │   │     ▼                              │
   │  → mixed_mono_16k.wav    │   │  AppAdapter (per app)              │
   │  + mic VAD → self spans   │   │   OCR names + active-speaker cue   │
   └────────────┬───────────┘   │   → SpeakerObservation              │
                │               └──────────────┬────────────────────┘
                │                              ▼
                │                  ┌───────────────────────────┐
                │   self spans ───▶│      TimelineBuilder        │
                │                  │  debounce/coalesce + merge   │
                │                  │  mic-VAD self spans          │
                │                  │  → visual_timeline.json      │
                │                  └──────────────┬────────────┘
                │                                 │
                ▼                                 ▼
        ┌──────────────────────────────────────────────────────────┐
        │                       SessionPackager                      │
        │   mixed_mono_16k.wav + visual_timeline.json + manifest      │
        │   + chunk plan (if call > ~3 min)                           │
        └───────────────────────────┬──────────────────────────────┘
                                     ▼
                   ┌──────────────────────────────────────┐
                   │          SparkControlClient            │
                   │  per chunk, SEQUENTIAL:                 │
                   │   POST /api/audio/label-merge           │
                   │     file=chunk.wav                      │
                   │     timeline=<chunk-local segments>     │
                   │     known_voiceprints=<from store>      │
                   │     transcribe=true                     │
                   │  → named segments + per-speaker prints   │
                   └───────────────┬───────────┬────────────┘
                                   │           │
                       offset+stitch│           │ fingerprints (keyed by name)
                                   ▼           ▼
                       ┌────────────────┐  ┌────────────────────┐
                       │  speakers.json  │  │   VoiceprintStore   │
                       │ (named, global) │  │  persist/replay     │
                       └────────────────┘  └────────────────────┘
```

(After `speakers.json`, a recap phase renders `transcript.md` + `recap.html` via
the backend LLM — see §2.11.)

## 2. Modules

### 2.1 `CallDetector`
Fire `callStarted(app:window:)` / `callEnded()`.
- **Mic active system-wide:** CoreAudio `kAudioDevicePropertyDeviceIsRunningSomewhere`
  on the default input device (listener, not poll).
- **App present/active:** `NSWorkspace` running/frontmost vs the bundle-ID table.
- **Meet (browser):** when a browser is frontmost + mic live, read the active-tab
  URL (AppleScript/Accessibility); confirm `meet.google.com`.
- **Heuristic:** `mic_running` AND (`known_native_app_active` OR `browser+meet_tab`).
  Debounce ~2 s open; end when mic quiet > N s and the app/tab leaves foreground
  or quits.
- Output: app id + the call window (`SCWindow`) for the `VisualObserver`.

### 2.2 `AudioRecorder`
- **System audio:** `SCStream` with `capturesAudio = true` (mixer-level; works with
  headphones; no BlackHole). macOS 13+.
- **Mic:** `AVAudioEngine` input tap.
- Outputs: `mic.wav`, `system.wav`, and the backend deliverable
  **`mixed_mono_16k.wav`** (mic + system summed → mono → 16 kHz PCM WAV).
- **Shared `t0`** (`CACurrentMediaTime`) stamped once; every audio sample and
  visual observation is relative to it. Non-negotiable — the merge depends on it.
- **Mic VAD:** run lightweight energy/VAD on the mic track to emit "the user is
  speaking" spans. These feed `TimelineBuilder` as pre-seeded **self** segments
  (high confidence) so the backend names the user even when their own tile isn't
  read — and so the user's voiceprint enrolls on call one.

### 2.3 `VisualObserver`
- `SCStream` scoped (via `SCContentFilter`) to the **specific call window**.
- Throttle to adapter fps (default 3). Hand each frame to the adapter; **release
  immediately — never persist a frame.**
- **Window visibility / focus is NOT required.** SCK captures a window's own
  rendered content even when it's in the background, occluded by other apps, or
  on another Space. The user can work in other apps during the call and visual
  capture continues normally. (This is a key reason for window capture over
  display capture — also more private.)
- **Capture liveness — the one real failure mode.** Two states stop fresh frames:
  1. **Minimized to the Dock** — macOS may freeze the window's backing buffer, so
     SCK delivers stale/duplicate frames. Detect minimization
     (`SCWindow.isOnScreen == false` / window state) and **pause visual analysis +
     flag a `visual_gap` for that span** rather than emitting bogus observations.
  2. **Browser tab switched away (Meet only)** — see §2.4 Meet note.
  In both cases **audio keeps recording**, and the backend voiceprint fallback
  still names previously-heard speakers — so a gap only costs naming precision for
  *new, never-seen* speakers during that exact window. Record gaps in
  `visual_timeline.json` (a `visual_gaps: [{start,end,reason}]` array) so the
  cause is auditable; `TimelineBuilder` must not interpolate across a gap.

### 2.4 `AppAdapter` (protocol) + four implementations
```swift
protocol AppAdapter {
    static var bundleIDs: [String] { get }
    var preferredFPS: Int { get }
    func analyze(frame: CVPixelBuffer, at t: TimeInterval) -> [SpeakerObservation]
    func namesFromAccessibility() -> [String]?   // optional
}
struct SpeakerObservation {
    let name: String          // OCR'd / a11y name; "" if unknown
    let speaking: Bool        // active-speaker cue detected
    let bbox: CGRect
    let confidence: Double     // 0..1
    let t: TimeInterval        // relative to session t0
}
```
Per-adapter cues:
- **Zoom** (`us.zoom.xos`): colored tile border = active speaker; OCR the tile
  name label; handle speaker + gallery layouts.
- **Teams** (`com.microsoft.teams2`): colored ring/border; labeled; like Zoom.
- **Signal** (`org.whispersystems.signal-desktop`): ring around avatar/initials;
  try `namesFromAccessibility()` first (Electron a11y tree), OCR fallback.
- **Meet** (browser): **hybrid** — names via Accessibility/AppleScript (DOM text),
  speaking cue via Vision (canvas/WebGL animated bars / tile highlight), fused by
  tile position. Most likely to need iteration.
  - **Tab-switch caveat (Meet-specific):** if Meet is a browser *tab* and the user
    switches to a different tab **in the same window**, the browser stops rendering
    the Meet tab → SCK captures a frozen last-frame (a `visual_gap`). Switching to a
    different *app* is fine; switching tabs is not. Mitigations, in order: (1)
    detect the active-tab URL leaving `meet.google.com` and flag a `visual_gap`
    (don't emit stale observations); (2) prefer capturing Meet in a **dedicated
    browser window / PWA / standalone window** so tab-switching can't blank it
    — surface this as a one-time setup tip in the UI; (3) names still come from the
    a11y/DOM tree where available, and audio + voiceprint fallback carry identity
    through the gap regardless.

Each adapter is **testable offline** against PNG/JPEG frame fixtures.

### 2.5 `TimelineBuilder`
Turn noisy per-frame observations into clean `(start, end, name, confidence)`
segments.
- Group by name; open a segment after K consecutive speaking frames (e.g. 2),
  close after M quiet frames (e.g. 2) — hysteresis rides out UI-cue lag/flicker.
- **Allow overlaps** (crosstalk). Do not force one speaker per instant.
- Merge in the mic-VAD **self** spans (the user) with high confidence.
- Normalize OCR name variants ("Sarah J" → "Sarah Jones") via a per-session
  alias table.
- Emit `visual_timeline.json` (schema in `03_DATA_CONTRACTS.md`). The flat
  `segments` array maps directly onto the `timeline` field `label-merge` wants.

### 2.6 `SessionPackager`
Write the session folder and, if the call is longer than ~3 min, produce a
**chunk plan**: ~2–3 min windows on `mixed_mono_16k.wav`, each with its
**timeline slice rebased to chunk-local seconds**.
```
~/Ten31Transcripts/sessions/2026-06-05T14-03_zoom/
  mic.wav  system.wav  mixed_mono_16k.wav
  visual_timeline.json
  manifest.json
  (chunks/ produced transiently if chunking)
  speakers.json            # written after backend hand-off
```

### 2.7 `SparkControlClient`
Deliver to SparkControl. **Primary path = `POST /api/audio/label-merge`**. Sends
**dual-channel** (`mic_file` + `system_file` + `self_name` + `self_vad`) when the
system track is healthy, else the **mono** `file`; always with `timeline`,
`known_voiceprints`, `transcribe=true`.
- **Sequential only** — one audio request in flight (parallel ⇒ `503 + Retry-After`).
- **Self-signed TLS** — skip verification (`URLSession` delegate trusting the
  Start9 cert) or trust the Root CA. **No auth on the LAN.**
- **Per chunk:** call `label-merge` with that chunk's audio + rebased timeline +
  the **accumulated** voiceprints; offset returned timestamps back to global and
  append. Names unify across chunks because the same names/voiceprints are passed
  each time; new voiceprints accumulate into the store.
- Retry on `503` after `Retry-After`; on hard failure keep the session folder and
  surface "Resend" in the UI.
- Limits to respect: **200 MB/request** (`413`), transcription timeout ~300 s,
  diarization ~600 s. Chunking keeps requests well under these.
- See `03_DATA_CONTRACTS.md §4` for exact fields and a real response.

### 2.8 result assembly → `speakers.json`
Concatenate the per-chunk `label-merge` results into one global, named,
speaker-attributed transcript (timestamps offset to session time). This is the
seam to the user's existing summarizer. The app does not analyze past this.

### 2.9 `VoiceprintStore`
Local persistence of named voiceprints — the compounding-identity layer.
- File: `~/Ten31Transcripts/voiceprints.json` →
  `{ "<name>": { "vector": [192 floats], "updated": <iso>, "calls": <int> } }`.
- **On send:** load all entries → pass as `known_voiceprints` to `label-merge`.
- **On response:** for each speaker resolved by **visual** (or a high-similarity
  **voiceprint** match), store/refresh that name's vector. **Never** store
  `Unknown_N`.
- **Update policy (open, start simple):** overwrite with the latest
  high-confidence vector, or keep a running mean per name. v1 = store latest with
  `overlap_confidence ≥ ~0.8`; refine to averaging later (`01 §7.4`).
- Editable/clearable from the menu-bar UI (rename, delete a person, reset).

### 2.10 `MenuBarUI` (SwiftUI, `LSUIElement`)
Status (idle / detected / recording / finishing), manual start/stop with live
mic/system level meters, and the **last session** — reveal in Finder, resend
("Send to backend"), open recap, and edit speakers — plus "Open saved session…"
to reprocess an existing folder. Also a **backend host + health check**
(`GET /api/status`), adapter toggles, output folder, and a permissions checklist
(Microphone, Screen Recording, Accessibility). (No multi-session list or
voiceprint-manager UI yet — those are in `ROADMAP.md`.)

### 2.11 Recap (`RecapAnalyzer`, `RecapRenderer`)
After `speakers.json`, the recap phase turns the named transcript into the
human-readable deliverables. `RecapAnalyzer` calls the backend LLM
(`POST /v1/chat/completions`, Qwen3) for topics + meeting extras; `RecapRenderer`
writes `transcript.md` (one line per diarized utterance) and `recap.html` (+ a
`recap.json` sidecar). The in-app speaker editor (`SpeakerEditing` /
`RecapEditModel`) rewrites names across all outputs after the fact. All
language-model work stays on the backend; the app orchestrates and renders.

## 3. macOS frameworks & permissions

| Need | Framework | Permission |
|---|---|---|
| System audio + window frames | ScreenCaptureKit | Screen Recording |
| Microphone | AVFoundation / CoreAudio | Microphone |
| Meet/Signal names, tab URL | Accessibility (AXUIElement) / AppleScript | Accessibility + Automation |
| OCR + cue analysis | Vision (`VNRecognizeTextRequest`) | none |
| App/tab detection | AppKit (`NSWorkspace`) | none |

Stable signing identity avoids permission re-prompts on rebuild.

## 4. Performance
Window-scoped capture + 3 fps + Vision-on-Neural-Engine is light; audio is cheap;
frames are released immediately so memory stays flat. The app idles near-zero
until a call starts. Backend requests are sequential and chunked, so they never
saturate the GPU.

## 5. The merge — now done by the backend
The app no longer implements the overlap vote. `label-merge` resolves each
anonymous cluster in order:
1. **visual** — timeline name with the most temporal overlap (`source: "visual"`,
   `overlap_confidence`);
2. **voiceprint** — closest `known_voiceprints` match above `voiceprint_threshold`
   (`source: "voiceprint"`, `match_similarity`);
3. **`Unknown_N`** (`source: "unmatched"`) — never guessed/mislabeled.
The app's contribution is a good timeline (incl. mic-VAD self spans) and an
ever-growing voiceprint library. `min_overlap` and `voiceprint_threshold` are
tunable request params if precision needs adjusting.
```
```