Files
ten31-transcripts/docs/02_ARCHITECTURE.md
T
Grant Gilliam b2ae3a62b9 Phase 0: menu-bar scaffold, permissions, backend health check
Native SwiftUI menu-bar app (LSUIElement, macOS 13+), generated from project.yml
via XcodeGen. Includes:
- PermissionsManager (Microphone / Screen Recording / Accessibility) + UI
- SparkControlHealth: GET /api/status over self-signed TLS (InsecureTrustDelegate)
- AppSettings persistence (host, TLS-skip, output folder, adapter toggles)
- Menu-bar panel + Settings, app sandbox & hardened runtime off (LAN tool)
2026-06-05 19:33:53 -05:00

15 KiB
Raw Blame History

Architecture — Ten31 Transcripts

Companion to 01_PROJECT_BRIEF.md. Module layout, data flow, the per-app adapter pattern, the macOS APIs, and the SparkControl integration (now fully specified).


1. High-level data flow

                    ┌─────────────────────────────────────────┐
                    │              CallDetector                 │
                    │  CoreAudio "mic running somewhere"        │
                    │  + known-app / Meet-tab heuristic         │
                    └───────────────┬───────────────────────────┘
                                    │ callStarted(app, window)
                                    ▼
        ┌──────────────────────────────────────────────────────────┐
        │                    SessionController                       │
        │   owns one Session; shared t0; start/stop; on end package  │
        └───────┬───────────────────────────┬──────────────────────┘
                │                            │
                ▼                            ▼
   ┌────────────────────────┐   ┌───────────────────────────────────┐
   │     AudioRecorder       │   │         VisualObserver            │
   │  SCStream system audio  │   │  SCStream window frames @24fps    │
   │  AVAudioEngine mic       │   │     │ (frames released, never saved)│
   │  → mic.wav, system.wav   │   │     ▼                              │
   │  → mixed_mono_16k.wav    │   │  AppAdapter (per app)              │
   │  + mic VAD → self spans   │   │   OCR names + active-speaker cue   │
   └────────────┬───────────┘   │   → SpeakerObservation              │
                │               └──────────────┬────────────────────┘
                │                              ▼
                │                  ┌───────────────────────────┐
                │   self spans ───▶│      TimelineBuilder        │
                │                  │  debounce/coalesce + merge   │
                │                  │  mic-VAD self spans          │
                │                  │  → visual_timeline.json      │
                │                  └──────────────┬────────────┘
                │                                 │
                ▼                                 ▼
        ┌──────────────────────────────────────────────────────────┐
        │                       SessionPackager                      │
        │   mixed_mono_16k.wav + visual_timeline.json + manifest      │
        │   + chunk plan (if call > ~3 min)                           │
        └───────────────────────────┬──────────────────────────────┘
                                     ▼
                   ┌──────────────────────────────────────┐
                   │          SparkControlClient            │
                   │  per chunk, SEQUENTIAL:                 │
                   │   POST /api/audio/label-merge           │
                   │     file=chunk.wav                      │
                   │     timeline=<chunk-local segments>     │
                   │     known_voiceprints=<from store>      │
                   │     transcribe=true                     │
                   │  → named segments + per-speaker prints   │
                   └───────────────┬───────────┬────────────┘
                                   │           │
                       offset+stitch│           │ fingerprints (keyed by name)
                                   ▼           ▼
                       ┌────────────────┐  ┌────────────────────┐
                       │  speakers.json  │  │   VoiceprintStore   │
                       │ (named, global) │  │  persist/replay     │
                       └────────────────┘  └────────────────────┘

2. Modules

2.1 CallDetector

Fire callStarted(app:window:) / callEnded().

  • Mic active system-wide: CoreAudio kAudioDevicePropertyDeviceIsRunningSomewhere on the default input device (listener, not poll).
  • App present/active: NSWorkspace running/frontmost vs the bundle-ID table.
  • Meet (browser): when a browser is frontmost + mic live, read the active-tab URL (AppleScript/Accessibility); confirm meet.google.com.
  • Heuristic: mic_running AND (known_native_app_active OR browser+meet_tab). Debounce ~2 s open; end when mic quiet > N s and the app/tab leaves foreground or quits.
  • Output: app id + the call window (SCWindow) for the VisualObserver.

2.2 AudioRecorder

  • System audio: SCStream with capturesAudio = true (mixer-level; works with headphones; no BlackHole). macOS 13+.
  • Mic: AVAudioEngine input tap.
  • Outputs: mic.wav, system.wav, and the backend deliverable mixed_mono_16k.wav (mic + system summed → mono → 16 kHz PCM WAV).
  • Shared t0 (CACurrentMediaTime) stamped once; every audio sample and visual observation is relative to it. Non-negotiable — the merge depends on it.
  • Mic VAD: run lightweight energy/VAD on the mic track to emit "the user is speaking" spans. These feed TimelineBuilder as pre-seeded self segments (high confidence) so the backend names the user even when their own tile isn't read — and so the user's voiceprint enrolls on call one.

2.3 VisualObserver

  • SCStream scoped (via SCContentFilter) to the specific call window.
  • Throttle to adapter fps (default 3). Hand each frame to the adapter; release immediately — never persist a frame.
  • Window visibility / focus is NOT required. SCK captures a window's own rendered content even when it's in the background, occluded by other apps, or on another Space. The user can work in other apps during the call and visual capture continues normally. (This is a key reason for window capture over display capture — also more private.)
  • Capture liveness — the one real failure mode. Two states stop fresh frames:
    1. Minimized to the Dock — macOS may freeze the window's backing buffer, so SCK delivers stale/duplicate frames. Detect minimization (SCWindow.isOnScreen == false / window state) and pause visual analysis + flag a visual_gap for that span rather than emitting bogus observations.
    2. Browser tab switched away (Meet only) — see §2.4 Meet note. In both cases audio keeps recording, and the backend voiceprint fallback still names previously-heard speakers — so a gap only costs naming precision for new, never-seen speakers during that exact window. Record gaps in visual_timeline.json (a visual_gaps: [{start,end,reason}] array) so the cause is auditable; TimelineBuilder must not interpolate across a gap.

2.4 AppAdapter (protocol) + four implementations

protocol AppAdapter {
    static var bundleIDs: [String] { get }
    var preferredFPS: Int { get }
    func analyze(frame: CVPixelBuffer, at t: TimeInterval) -> [SpeakerObservation]
    func namesFromAccessibility() -> [String]?   // optional
}
struct SpeakerObservation {
    let name: String          // OCR'd / a11y name; "" if unknown
    let speaking: Bool        // active-speaker cue detected
    let bbox: CGRect
    let confidence: Double     // 0..1
    let t: TimeInterval        // relative to session t0
}

Per-adapter cues:

  • Zoom (us.zoom.xos): colored tile border = active speaker; OCR the tile name label; handle speaker + gallery layouts.
  • Teams (com.microsoft.teams2): colored ring/border; labeled; like Zoom.
  • Signal (org.whispersystems.signal-desktop): ring around avatar/initials; try namesFromAccessibility() first (Electron a11y tree), OCR fallback.
  • Meet (browser): hybrid — names via Accessibility/AppleScript (DOM text), speaking cue via Vision (canvas/WebGL animated bars / tile highlight), fused by tile position. Most likely to need iteration.
    • Tab-switch caveat (Meet-specific): if Meet is a browser tab and the user switches to a different tab in the same window, the browser stops rendering the Meet tab → SCK captures a frozen last-frame (a visual_gap). Switching to a different app is fine; switching tabs is not. Mitigations, in order: (1) detect the active-tab URL leaving meet.google.com and flag a visual_gap (don't emit stale observations); (2) prefer capturing Meet in a dedicated browser window / PWA / standalone window so tab-switching can't blank it — surface this as a one-time setup tip in the UI; (3) names still come from the a11y/DOM tree where available, and audio + voiceprint fallback carry identity through the gap regardless.

Each adapter is testable offline against PNG/JPEG frame fixtures.

2.5 TimelineBuilder

Turn noisy per-frame observations into clean (start, end, name, confidence) segments.

  • Group by name; open a segment after K consecutive speaking frames (e.g. 2), close after M quiet frames (e.g. 2) — hysteresis rides out UI-cue lag/flicker.
  • Allow overlaps (crosstalk). Do not force one speaker per instant.
  • Merge in the mic-VAD self spans (the user) with high confidence.
  • Normalize OCR name variants ("Sarah J" → "Sarah Jones") via a per-session alias table.
  • Emit visual_timeline.json (schema in 03_DATA_CONTRACTS.md). The flat segments array maps directly onto the timeline field label-merge wants.

2.6 SessionPackager

Write the session folder and, if the call is longer than ~3 min, produce a chunk plan: ~23 min windows on mixed_mono_16k.wav, each with its timeline slice rebased to chunk-local seconds.

~/Ten31Transcripts/sessions/2026-06-05T14-03_zoom/
  mic.wav  system.wav  mixed_mono_16k.wav
  visual_timeline.json
  manifest.json
  (chunks/ produced transiently if chunking)
  speakers.json            # written after backend hand-off

2.7 SparkControlClient

Deliver to SparkControl. Primary path = POST /api/audio/label-merge with file, timeline, known_voiceprints, transcribe=true.

  • Sequential only — one audio request in flight (parallel ⇒ 503 + Retry-After).
  • Self-signed TLS — skip verification (URLSession delegate trusting the Start9 cert) or trust the Root CA. No auth on the LAN.
  • Per chunk: call label-merge with that chunk's audio + rebased timeline + the accumulated voiceprints; offset returned timestamps back to global and append. Names unify across chunks because the same names/voiceprints are passed each time; new voiceprints accumulate into the store.
  • Retry on 503 after Retry-After; on hard failure keep the session folder and surface "Resend" in the UI.
  • Limits to respect: 200 MB/request (413), transcription timeout ~300 s, diarization ~600 s. Chunking keeps requests well under these.
  • See 03_DATA_CONTRACTS.md §4 for exact fields and a real response.

2.8 result assembly → speakers.json

Concatenate the per-chunk label-merge results into one global, named, speaker-attributed transcript (timestamps offset to session time). This is the seam to the user's existing summarizer. The app does not analyze past this.

2.9 VoiceprintStore

Local persistence of named voiceprints — the compounding-identity layer.

  • File: ~/Ten31Transcripts/voiceprints.json{ "<name>": { "vector": [192 floats], "updated": <iso>, "calls": <int> } }.
  • On send: load all entries → pass as known_voiceprints to label-merge.
  • On response: for each speaker resolved by visual (or a high-similarity voiceprint match), store/refresh that name's vector. Never store Unknown_N.
  • Update policy (open, start simple): overwrite with the latest high-confidence vector, or keep a running mean per name. v1 = store latest with overlap_confidence ≥ ~0.8; refine to averaging later (01 §7.4).
  • Editable/clearable from the menu-bar UI (rename, delete a person, reset).

2.10 MenuBarUI (SwiftUI, LSUIElement)

Status (idle / detected / recording / uploading), manual start/stop, recent sessions (open folder, resend, delete), adapter toggles, backend host + a health check (GET /api/status), output folder, voiceprint manager, and a permissions checklist (Screen Recording, Microphone, Accessibility).

3. macOS frameworks & permissions

Need Framework Permission
System audio + window frames ScreenCaptureKit Screen Recording
Microphone AVFoundation / CoreAudio Microphone
Meet/Signal names, tab URL Accessibility (AXUIElement) / AppleScript Accessibility + Automation
OCR + cue analysis Vision (VNRecognizeTextRequest) none
App/tab detection AppKit (NSWorkspace) none

Stable signing identity avoids permission re-prompts on rebuild.

4. Performance

Window-scoped capture + 3 fps + Vision-on-Neural-Engine is light; audio is cheap; frames are released immediately so memory stays flat. The app idles near-zero until a call starts. Backend requests are sequential and chunked, so they never saturate the GPU.

5. The merge — now done by the backend

The app no longer implements the overlap vote. label-merge resolves each anonymous cluster in order:

  1. visual — timeline name with the most temporal overlap (source: "visual", overlap_confidence);
  2. voiceprint — closest known_voiceprints match above voiceprint_threshold (source: "voiceprint", match_similarity);
  3. Unknown_N (source: "unmatched") — never guessed/mislabeled. The app's contribution is a good timeline (incl. mic-VAD self spans) and an ever-growing voiceprint library. min_overlap and voiceprint_threshold are tunable request params if precision needs adjusting.