Files
ten31-transcripts/docs/01_PROJECT_BRIEF.md
T
Grant Gilliam dda4322de7 Reconcile docs/ specs with the shipped app
Document the dual-channel label-merge path (mic_file/system_file/self_name/self_vad) and the recap phase (transcript.md + recap.html via the backend LLM) across docs/01-03; correct docs/02 $2.10 to the UI actually shipped; mark docs/01 $7 open items as settled; remove the dead AUDIO_API.md references; note the manifest sha256 fields are not emitted; mark docs/04 as a complete/historical build log. Also drop the last stale "Phase 0" UI string in MenuBarView and retire the now-done doc-debt items in ROADMAP.
2026-06-16 22:09:04 -05:00

164 lines
9.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Project Brief — Ten31 Transcripts
> Local macOS app that auto-detects conference calls, records local audio, and
> produces a **visual-derived, timestamped speaker timeline** — then hands the
> mixed audio + that timeline to the operator's **SparkControl** backend, which
> diarizes, names the speakers (majority-overlap vote against the timeline), and
> returns named transcript segments. A growing **voiceprint library** recovers
> speakers even when the visual cue is missing.
Master context document. Read this first, then `02_ARCHITECTURE.md` and
`03_DATA_CONTRACTS.md`. The SparkControl API is fully specified in
`03_DATA_CONTRACTS.md`.
---
## 1. What we are building
A lightweight, always-running **menu-bar app on macOS** that:
1. **Detects** when the user joins a call in Google Meet, Zoom, Microsoft Teams,
or Signal.
2. **Records two local audio tracks** — system audio (everyone else) and the
user's microphone (the user). It sends the backend **dual-channel**
(`mic_file` + `system_file`) when the system track is healthy, falling back to
a **mixed-mono 16 kHz WAV** otherwise.
3. **Watches the call window** at ~24 fps and, per app, reads participant
**names** and the **active-speaker cue**, producing a
`(start, end, name, confidence)` **visual timeline** — its best guess at who
was talking when.
4. **Discards every video frame after extraction.** No video is ever written to
disk. Only audio + the derived timeline persist locally.
5. On call end, **POSTs the audio + the visual timeline (+ the known voiceprint
library) to `POST /api/audio/label-merge`** on SparkControl, which returns
**named, speaker-attributed transcript segments** and a **voiceprint per
speaker**.
6. **Persists the returned voiceprints** keyed by name, so the next call can pass
them as `known_voiceprints` and recover a speaker by voice when the visual cue
is absent (camera off, a bad OCR frame).
7. **Renders the result locally** — a readable `transcript.md` plus an HTML
`recap.html` (topics + meeting extras, generated via the backend's LLM
endpoint), with an in-app editor for fixing speaker names after the fact.
The app's job ends at producing the named transcript and recap from SparkControl's
segments. **All transcription, diarization, name-merge, and LLM analysis happen on
the backend.** Do not build transcription, diarization, or the merge vote in this
app.
## 2. Why the visual timeline still matters (the core idea)
Audio diarization (NVIDIA Sortformer on the backend) is excellent at
**segmentation** — precise speaker boundaries — but its clusters are **anonymous**
(`Speaker_0`, `Speaker_1`…). It cannot name anyone.
The screen *already knows the names*. Each app visually marks the active speaker
(colored tile border, animated audio bars, a ring around an avatar) next to that
person's name/initials.
So responsibilities split cleanly:
- **Audio (backend)** owns *segmentation* — the exact *when*.
- **Visual capture (this app)** owns *identity* — the *who*.
- **`label-merge` (backend)** fuses them: it diarizes, then assigns each cluster
the timeline name with the most temporal overlap. The visual track needn't be
perfect — it only needs to win the per-cluster vote.
**New compounding layer — the voiceprint library.** Every named cluster comes
back from the backend with a 192-dim TitaNet voiceprint. The app persists these
keyed by name and replays them as `known_voiceprints` next time. Resolution order
per cluster becomes: **visual overlap → voiceprint match → `Unknown_N`** (never
mislabeled). So the screen capture *enrolls a voice library for free*, and over a
few calls the system can name regulars even with cameras off.
## 3. Hard scope boundaries
**In scope (this app):**
- Call detection for Meet / Zoom / Teams / Signal.
- Dual-track local audio capture; **dual-channel send** (mic + system) with a
mix-to-mono fallback for the backend.
- Low-fps window capture → OCR (names) + active-speaker cue detection.
- Per-app "adapter" modules encapsulating each app's UI quirks.
- Building the visual timeline; **mic-VAD self-labeling** (the mic track is the
user, so hot-mic spans pre-seed the user's name into the timeline).
- Chunking long calls (~23 min) and calling `label-merge` **sequentially**.
- A local **voiceprint store** (persist + replay named voiceprints).
- Storing the backend's named segments and **rendering** them — `transcript.md`
plus an HTML `recap.html` (recap analysis via the backend LLM) — with an in-app
speaker-name editor.
- A minimal menu-bar UI: status, manual start/stop, the last session (reveal,
resend, open recap, edit speakers), adapter toggles, backend host/health,
output folder.
**Out of scope (owned by the backend):**
- Transcription, diarization, the name-merge vote, and LLM summarization — these
run on the backend; the app only orchestrates the recap call and renders the
result.
**Explicitly not doing:** saving video; cloud anything. Everything stays on the
operator's LAN.
## 4. Key decisions (now resolved against the real contract)
| Decision | Choice | Why |
|---|---|---|
| Language / framework | Native Swift + SwiftUI menu-bar app (`LSUIElement`) | System audio, window capture, Vision all native; one codebase. |
| Audio capture | ScreenCaptureKit (system audio) + AVFoundation (mic) | No virtual audio device; works with headphones; macOS 13+. |
| Backend audio format | **Dual-channel (mic + system)** when the system track is healthy, else **mixed-mono 16 kHz WAV** | Separate tracks let the backend attribute the user's mic channel directly; the diarizer can still split the mono fallback. |
| Call detection | CoreAudio "mic running somewhere" + known-app / Meet-tab heuristic | Clean live-mic signal + app disambiguation. |
| Speaker naming | **Backend, via `POST /api/audio/label-merge`** | One call does diarize + overlap-vote naming + transcription. No client merge. |
| Identity recovery | **Local voiceprint library** replayed as `known_voiceprints` | Recovers camera-off / OCR-missed speakers by voice; compounds over calls. |
| Self-identity | mic-VAD → pre-seed user's name in timeline | The mic track is the user; gives the backend a strong prior + enrolls the user's voiceprint immediately. |
| Requests | **Sequential, one audio request in flight** | Parallel audio requests trip a backend GPU race (`503 + Retry-After`). |
| Long calls | Chunk ~23 min, sequential, stitch via names+voiceprints | Diarizer caps at **4 speakers/chunk**; voiceprints + names unify across chunks. |
| Transport / TLS | `multipart/form-data`, file field `file` (mono) or `mic_file` + `system_file` (dual-channel); self-signed Start9 cert (trust the Root CA — supported default; host-scoped skip-verify is an off-by-default escape hatch); **no auth on LAN** | Matches every other SparkControl endpoint. |
| Timing | Batch after call (sync endpoints, no polling) | Endpoints are synchronous; no job/poll machinery needed. |
### On forking Hyprnote
Unchanged recommendation: the audio capture is the trivial part (~200 lines of
native Swift) and the rest (Vision screen-reading) is native too. Build native;
use Hyprnote's capture/detection only as reference. Fork remains an override.
## 5. Target apps & identifiers
| App | Join via | Bundle ID(s) | Speaking cue / names |
|---|---|---|---|
| Zoom | Native | `us.zoom.xos` | Colored tile border; name label in tile. |
| Microsoft Teams | Native (new) | `com.microsoft.teams2` (new), `com.microsoft.teams` (classic) | Colored ring/border; labeled. |
| Signal | Native (Electron) | `org.whispersystems.signal-desktop` | Ring around avatar/initials; try Accessibility names first. |
| Google Meet | Browser tab | `com.google.Chrome`, `com.apple.Safari`, `company.thebrowser.Browser` (Arc)… | Canvas video → Vision for the cue; DOM names → Accessibility/AppleScript; confirm via active-tab URL `meet.google.com`. |
Four required adapters; adding a 5th must be one new file conforming to the
`AppAdapter` protocol.
## 6. The backend (do not rebuild) — now concrete
SparkControl, on the operator's Start9 LAN, fronting two DGX Sparks:
- **STT:** NVIDIA Parakeet TDT 0.6B — `POST /v1/audio/transcriptions` (OpenAI-compatible).
- **Diarization:** NVIDIA Sortformer 4spk-v1 — `POST /api/audio/diarize-chunk`
(anonymous clusters + voiceprints) and `POST /api/audio/transcribe-with-speakers`.
- **Embeddings:** NVIDIA TitaNet (192-dim voiceprints).
- **★ Primary endpoint for this app:** `POST /api/audio/label-merge` — diarize +
name from the visual timeline (+ voiceprint fallback), optionally transcribe,
in one synchronous call.
- **LLM (recap):** Qwen3 via OpenAI-compatible `POST /v1/chat/completions`
generates the readable recap (topics + meeting extras) from the transcript.
- Health/discovery: `GET /api/status`, `GET /api/endpoints`, `GET /v1/models`.
Full request/response shapes, curl examples, limits, and error formats are in
`03_DATA_CONTRACTS.md`.
## 7. Settled decisions (were open at brief time)
1. **Base URL.** A private LAN host — a `.local` mDNS name (preferred over a raw
IP, since it survives IP changes) — configured in Settings or via the
`SPARK_BACKEND_URL` env var, never committed. A neutral placeholder ships as the
default and stays editable in Settings. Service-discovery at `GET /api/endpoints`.
2. **Send trigger.** Auto-send on call end is a setting (`autoSendOnStop`), **off
by default** — the user reviews the session and sends manually unless they opt in.
3. **Retention.** The session folder is kept after a successful hand-off (output
location is configurable); nothing is pruned automatically.
4. **Voiceprint update policy.** Store/refresh the latest high-confidence vector
per name (`02_ARCHITECTURE.md §2.9`); a per-name running average is a possible
later refinement.
5. **Signing.** A stable identity via `Config/Signing.xcconfig` (gitignored) keeps
macOS from re-prompting for permissions on each rebuild.