Files
ten31-transcripts/docs/01_PROJECT_BRIEF.md
T
Grant Gilliam 11eb82178f Add agent instructions; extract signing/backend secrets from source
- Add AGENTS.md (canonical) + CLAUDE.md symlink + ROADMAP.md
- Move Apple Team ID from project.yml into a gitignored
  Config/Signing.xcconfig via configFiles; commit the .example template
- Replace hardcoded backend host in AppSettings with a neutral
  placeholder + SPARK_BACKEND_URL env-var fallback
- Scrub the Team ID, .local host, and raw LAN IP from README/docs
- Ignore Config/Signing.xcconfig and .env
2026-06-13 12:23:54 -05:00

8.8 KiB
Raw Blame History

Project Brief — Ten31 Transcripts

Local macOS app that auto-detects conference calls, records local audio, and produces a visual-derived, timestamped speaker timeline — then hands the mixed audio + that timeline to the operator's SparkControl backend, which diarizes, names the speakers (majority-overlap vote against the timeline), and returns named transcript segments. A growing voiceprint library recovers speakers even when the visual cue is missing.

Master context document. Read this first, then 02_ARCHITECTURE.md, 03_DATA_CONTRACTS.md, 04_BUILD_PLAN.md. The SparkControl API is now fully specified — see 03_DATA_CONTRACTS.md (and the source AUDIO_API.md).


1. What we are building

A lightweight, always-running menu-bar app on macOS that:

  1. Detects when the user joins a call in Google Meet, Zoom, Microsoft Teams, or Signal.
  2. Records two local audio tracks — system audio (everyone else) and the user's microphone (the user) — and mixes them to one 16 kHz mono WAV for the backend.
  3. Watches the call window at ~24 fps and, per app, reads participant names and the active-speaker cue, producing a (start, end, name, confidence) visual timeline — its best guess at who was talking when.
  4. Discards every video frame after extraction. No video is ever written to disk. Only audio + the derived timeline persist locally.
  5. On call end, POSTs the mixed audio + the visual timeline (+ the known voiceprint library) to POST /api/audio/label-merge on SparkControl, which returns named, speaker-attributed transcript segments and a voiceprint per speaker.
  6. Persists the returned voiceprints keyed by name, so the next call can pass them as known_voiceprints and recover a speaker by voice when the visual cue is absent (camera off, a bad OCR frame).

The app's job ends at receiving and storing the named segments from SparkControl. All transcription, diarization, and the name-merge happen on the backend. Do not build transcription, diarization, or the merge vote in this app.

2. Why the visual timeline still matters (the core idea)

Audio diarization (NVIDIA Sortformer on the backend) is excellent at segmentation — precise speaker boundaries — but its clusters are anonymous (Speaker_0, Speaker_1…). It cannot name anyone.

The screen already knows the names. Each app visually marks the active speaker (colored tile border, animated audio bars, a ring around an avatar) next to that person's name/initials.

So responsibilities split cleanly:

  • Audio (backend) owns segmentation — the exact when.
  • Visual capture (this app) owns identity — the who.
  • label-merge (backend) fuses them: it diarizes, then assigns each cluster the timeline name with the most temporal overlap. The visual track needn't be perfect — it only needs to win the per-cluster vote.

New compounding layer — the voiceprint library. Every named cluster comes back from the backend with a 192-dim TitaNet voiceprint. The app persists these keyed by name and replays them as known_voiceprints next time. Resolution order per cluster becomes: visual overlap → voiceprint match → Unknown_N (never mislabeled). So the screen capture enrolls a voice library for free, and over a few calls the system can name regulars even with cameras off.

3. Hard scope boundaries

In scope (this app):

  • Call detection for Meet / Zoom / Teams / Signal.
  • Dual-track local audio capture + mix-to-mono for the backend.
  • Low-fps window capture → OCR (names) + active-speaker cue detection.
  • Per-app "adapter" modules encapsulating each app's UI quirks.
  • Building the visual timeline; mic-VAD self-labeling (the mic track is the user, so hot-mic spans pre-seed the user's name into the timeline).
  • Chunking long calls (~23 min) and calling label-merge sequentially.
  • A local voiceprint store (persist + replay named voiceprints).
  • Storing the backend's named transcript segments locally.
  • A minimal menu-bar UI: status, manual start/stop, recent sessions, adapter toggles, backend host/health, output folder.

Out of scope (owned by the backend):

  • Transcription, diarization, the name-merge vote, summarization/analysis.

Explicitly not doing: saving video; cloud anything. Everything stays on the operator's LAN.

4. Key decisions (now resolved against the real contract)

Decision Choice Why
Language / framework Native Swift + SwiftUI menu-bar app (LSUIElement) System audio, window capture, Vision all native; one codebase.
Audio capture ScreenCaptureKit (system audio) + AVFoundation (mic) No virtual audio device; works with headphones; macOS 13+.
Backend audio format Mixed-mono 16 kHz WAV Diarizer separates speakers from one mixed stream; 16 kHz is ideal.
Call detection CoreAudio "mic running somewhere" + known-app / Meet-tab heuristic Clean live-mic signal + app disambiguation.
Speaker naming Backend, via POST /api/audio/label-merge One call does diarize + overlap-vote naming + transcription. No client merge.
Identity recovery Local voiceprint library replayed as known_voiceprints Recovers camera-off / OCR-missed speakers by voice; compounds over calls.
Self-identity mic-VAD → pre-seed user's name in timeline The mic track is the user; gives the backend a strong prior + enrolls the user's voiceprint immediately.
Requests Sequential, one audio request in flight Parallel audio requests trip a backend GPU race (503 + Retry-After).
Long calls Chunk ~23 min, sequential, stitch via names+voiceprints Diarizer caps at 4 speakers/chunk; voiceprints + names unify across chunks.
Transport / TLS multipart/form-data, file field file; self-signed Start9 cert (skip verify or trust the Root CA); no auth on LAN Matches every other SparkControl endpoint.
Timing Batch after call (sync endpoints, no polling) Endpoints are synchronous; no job/poll machinery needed.

On forking Hyprnote

Unchanged recommendation: the audio capture is the trivial part (~200 lines of native Swift) and the rest (Vision screen-reading) is native too. Build native; use Hyprnote's capture/detection only as reference. Fork remains an override.

5. Target apps & identifiers

App Join via Bundle ID(s) Speaking cue / names
Zoom Native us.zoom.xos Colored tile border; name label in tile.
Microsoft Teams Native (new) com.microsoft.teams2 (new), com.microsoft.teams (classic) Colored ring/border; labeled.
Signal Native (Electron) org.whispersystems.signal-desktop Ring around avatar/initials; try Accessibility names first.
Google Meet Browser tab com.google.Chrome, com.apple.Safari, company.thebrowser.Browser (Arc)… Canvas video → Vision for the cue; DOM names → Accessibility/AppleScript; confirm via active-tab URL meet.google.com.

Four required adapters; adding a 5th must be one new file conforming to the AppAdapter protocol.

6. The backend (do not rebuild) — now concrete

SparkControl, on the operator's Start9 LAN, fronting two DGX Sparks:

  • STT: NVIDIA Parakeet TDT 0.6B — POST /v1/audio/transcriptions (OpenAI-compatible).
  • Diarization: NVIDIA Sortformer 4spk-v1 — POST /api/audio/diarize-chunk (anonymous clusters + voiceprints) and POST /api/audio/transcribe-with-speakers.
  • Embeddings: NVIDIA TitaNet (192-dim voiceprints).
  • ★ Primary endpoint for this app: POST /api/audio/label-merge — diarize + name from the visual timeline (+ voiceprint fallback), optionally transcribe, in one synchronous call.
  • Health/discovery: GET /api/status, GET /api/endpoints, GET /v1/models.

Full request/response shapes, curl examples, limits, and error formats are in 03_DATA_CONTRACTS.md.

7. Remaining open items (small)

  1. Base URL — RESOLVED. A private LAN host — a .local mDNS name (preferred over a raw IP, since it survives IP changes) — configured in Settings or via the SPARK_BACKEND_URL env var, and never committed. Ship a neutral placeholder as the default; keep it editable in settings. Service-discovery at GET /api/endpoints.
  2. Send trigger — assume auto-POST on call end; expose a "hold for review" toggle if the user wants to eyeball the timeline first.
  3. Retention — keep the session folder after a successful hand-off, or prune audio and keep only speakers.json + voiceprints? Default: keep everything, user-configurable.
  4. Voiceprint update policy — overwrite vs running-average a person's stored voiceprint across calls (see 02_ARCHITECTURE.md §2.9). Start simple (store/refresh latest high-confidence), refine later.
  5. Signing — stable identity so macOS doesn't re-prompt for permissions on each rebuild.