Files
Grant Gilliam dda4322de7 Reconcile docs/ specs with the shipped app
Document the dual-channel label-merge path (mic_file/system_file/self_name/self_vad) and the recap phase (transcript.md + recap.html via the backend LLM) across docs/01-03; correct docs/02 $2.10 to the UI actually shipped; mark docs/01 $7 open items as settled; remove the dead AUDIO_API.md references; note the manifest sha256 fields are not emitted; mark docs/04 as a complete/historical build log. Also drop the last stale "Phase 0" UI string in MenuBarView and retire the now-done doc-debt items in ROADMAP.
2026-06-16 22:09:04 -05:00

9.8 KiB
Raw Permalink Blame History

Project Brief — Ten31 Transcripts

Local macOS app that auto-detects conference calls, records local audio, and produces a visual-derived, timestamped speaker timeline — then hands the mixed audio + that timeline to the operator's SparkControl backend, which diarizes, names the speakers (majority-overlap vote against the timeline), and returns named transcript segments. A growing voiceprint library recovers speakers even when the visual cue is missing.

Master context document. Read this first, then 02_ARCHITECTURE.md and 03_DATA_CONTRACTS.md. The SparkControl API is fully specified in 03_DATA_CONTRACTS.md.


1. What we are building

A lightweight, always-running menu-bar app on macOS that:

  1. Detects when the user joins a call in Google Meet, Zoom, Microsoft Teams, or Signal.
  2. Records two local audio tracks — system audio (everyone else) and the user's microphone (the user). It sends the backend dual-channel (mic_file + system_file) when the system track is healthy, falling back to a mixed-mono 16 kHz WAV otherwise.
  3. Watches the call window at ~24 fps and, per app, reads participant names and the active-speaker cue, producing a (start, end, name, confidence) visual timeline — its best guess at who was talking when.
  4. Discards every video frame after extraction. No video is ever written to disk. Only audio + the derived timeline persist locally.
  5. On call end, POSTs the audio + the visual timeline (+ the known voiceprint library) to POST /api/audio/label-merge on SparkControl, which returns named, speaker-attributed transcript segments and a voiceprint per speaker.
  6. Persists the returned voiceprints keyed by name, so the next call can pass them as known_voiceprints and recover a speaker by voice when the visual cue is absent (camera off, a bad OCR frame).
  7. Renders the result locally — a readable transcript.md plus an HTML recap.html (topics + meeting extras, generated via the backend's LLM endpoint), with an in-app editor for fixing speaker names after the fact.

The app's job ends at producing the named transcript and recap from SparkControl's segments. All transcription, diarization, name-merge, and LLM analysis happen on the backend. Do not build transcription, diarization, or the merge vote in this app.

2. Why the visual timeline still matters (the core idea)

Audio diarization (NVIDIA Sortformer on the backend) is excellent at segmentation — precise speaker boundaries — but its clusters are anonymous (Speaker_0, Speaker_1…). It cannot name anyone.

The screen already knows the names. Each app visually marks the active speaker (colored tile border, animated audio bars, a ring around an avatar) next to that person's name/initials.

So responsibilities split cleanly:

  • Audio (backend) owns segmentation — the exact when.
  • Visual capture (this app) owns identity — the who.
  • label-merge (backend) fuses them: it diarizes, then assigns each cluster the timeline name with the most temporal overlap. The visual track needn't be perfect — it only needs to win the per-cluster vote.

New compounding layer — the voiceprint library. Every named cluster comes back from the backend with a 192-dim TitaNet voiceprint. The app persists these keyed by name and replays them as known_voiceprints next time. Resolution order per cluster becomes: visual overlap → voiceprint match → Unknown_N (never mislabeled). So the screen capture enrolls a voice library for free, and over a few calls the system can name regulars even with cameras off.

3. Hard scope boundaries

In scope (this app):

  • Call detection for Meet / Zoom / Teams / Signal.
  • Dual-track local audio capture; dual-channel send (mic + system) with a mix-to-mono fallback for the backend.
  • Low-fps window capture → OCR (names) + active-speaker cue detection.
  • Per-app "adapter" modules encapsulating each app's UI quirks.
  • Building the visual timeline; mic-VAD self-labeling (the mic track is the user, so hot-mic spans pre-seed the user's name into the timeline).
  • Chunking long calls (~23 min) and calling label-merge sequentially.
  • A local voiceprint store (persist + replay named voiceprints).
  • Storing the backend's named segments and rendering them — transcript.md plus an HTML recap.html (recap analysis via the backend LLM) — with an in-app speaker-name editor.
  • A minimal menu-bar UI: status, manual start/stop, the last session (reveal, resend, open recap, edit speakers), adapter toggles, backend host/health, output folder.

Out of scope (owned by the backend):

  • Transcription, diarization, the name-merge vote, and LLM summarization — these run on the backend; the app only orchestrates the recap call and renders the result.

Explicitly not doing: saving video; cloud anything. Everything stays on the operator's LAN.

4. Key decisions (now resolved against the real contract)

Decision Choice Why
Language / framework Native Swift + SwiftUI menu-bar app (LSUIElement) System audio, window capture, Vision all native; one codebase.
Audio capture ScreenCaptureKit (system audio) + AVFoundation (mic) No virtual audio device; works with headphones; macOS 13+.
Backend audio format Dual-channel (mic + system) when the system track is healthy, else mixed-mono 16 kHz WAV Separate tracks let the backend attribute the user's mic channel directly; the diarizer can still split the mono fallback.
Call detection CoreAudio "mic running somewhere" + known-app / Meet-tab heuristic Clean live-mic signal + app disambiguation.
Speaker naming Backend, via POST /api/audio/label-merge One call does diarize + overlap-vote naming + transcription. No client merge.
Identity recovery Local voiceprint library replayed as known_voiceprints Recovers camera-off / OCR-missed speakers by voice; compounds over calls.
Self-identity mic-VAD → pre-seed user's name in timeline The mic track is the user; gives the backend a strong prior + enrolls the user's voiceprint immediately.
Requests Sequential, one audio request in flight Parallel audio requests trip a backend GPU race (503 + Retry-After).
Long calls Chunk ~23 min, sequential, stitch via names+voiceprints Diarizer caps at 4 speakers/chunk; voiceprints + names unify across chunks.
Transport / TLS multipart/form-data, file field file (mono) or mic_file + system_file (dual-channel); self-signed Start9 cert (trust the Root CA — supported default; host-scoped skip-verify is an off-by-default escape hatch); no auth on LAN Matches every other SparkControl endpoint.
Timing Batch after call (sync endpoints, no polling) Endpoints are synchronous; no job/poll machinery needed.

On forking Hyprnote

Unchanged recommendation: the audio capture is the trivial part (~200 lines of native Swift) and the rest (Vision screen-reading) is native too. Build native; use Hyprnote's capture/detection only as reference. Fork remains an override.

5. Target apps & identifiers

App Join via Bundle ID(s) Speaking cue / names
Zoom Native us.zoom.xos Colored tile border; name label in tile.
Microsoft Teams Native (new) com.microsoft.teams2 (new), com.microsoft.teams (classic) Colored ring/border; labeled.
Signal Native (Electron) org.whispersystems.signal-desktop Ring around avatar/initials; try Accessibility names first.
Google Meet Browser tab com.google.Chrome, com.apple.Safari, company.thebrowser.Browser (Arc)… Canvas video → Vision for the cue; DOM names → Accessibility/AppleScript; confirm via active-tab URL meet.google.com.

Four required adapters; adding a 5th must be one new file conforming to the AppAdapter protocol.

6. The backend (do not rebuild) — now concrete

SparkControl, on the operator's Start9 LAN, fronting two DGX Sparks:

  • STT: NVIDIA Parakeet TDT 0.6B — POST /v1/audio/transcriptions (OpenAI-compatible).
  • Diarization: NVIDIA Sortformer 4spk-v1 — POST /api/audio/diarize-chunk (anonymous clusters + voiceprints) and POST /api/audio/transcribe-with-speakers.
  • Embeddings: NVIDIA TitaNet (192-dim voiceprints).
  • ★ Primary endpoint for this app: POST /api/audio/label-merge — diarize + name from the visual timeline (+ voiceprint fallback), optionally transcribe, in one synchronous call.
  • LLM (recap): Qwen3 via OpenAI-compatible POST /v1/chat/completions — generates the readable recap (topics + meeting extras) from the transcript.
  • Health/discovery: GET /api/status, GET /api/endpoints, GET /v1/models.

Full request/response shapes, curl examples, limits, and error formats are in 03_DATA_CONTRACTS.md.

7. Settled decisions (were open at brief time)

  1. Base URL. A private LAN host — a .local mDNS name (preferred over a raw IP, since it survives IP changes) — configured in Settings or via the SPARK_BACKEND_URL env var, never committed. A neutral placeholder ships as the default and stays editable in Settings. Service-discovery at GET /api/endpoints.
  2. Send trigger. Auto-send on call end is a setting (autoSendOnStop), off by default — the user reviews the session and sends manually unless they opt in.
  3. Retention. The session folder is kept after a successful hand-off (output location is configurable); nothing is pruned automatically.
  4. Voiceprint update policy. Store/refresh the latest high-confidence vector per name (02_ARCHITECTURE.md §2.9); a per-name running average is a possible later refinement.
  5. Signing. A stable identity via Config/Signing.xcconfig (gitignored) keeps macOS from re-prompting for permissions on each rebuild.