dda4322de7
Document the dual-channel label-merge path (mic_file/system_file/self_name/self_vad) and the recap phase (transcript.md + recap.html via the backend LLM) across docs/01-03; correct docs/02 $2.10 to the UI actually shipped; mark docs/01 $7 open items as settled; remove the dead AUDIO_API.md references; note the manifest sha256 fields are not emitted; mark docs/04 as a complete/historical build log. Also drop the last stale "Phase 0" UI string in MenuBarView and retire the now-done doc-debt items in ROADMAP.
164 lines
9.8 KiB
Markdown
164 lines
9.8 KiB
Markdown
# Project Brief — Ten31 Transcripts
|
||
|
||
> Local macOS app that auto-detects conference calls, records local audio, and
|
||
> produces a **visual-derived, timestamped speaker timeline** — then hands the
|
||
> mixed audio + that timeline to the operator's **SparkControl** backend, which
|
||
> diarizes, names the speakers (majority-overlap vote against the timeline), and
|
||
> returns named transcript segments. A growing **voiceprint library** recovers
|
||
> speakers even when the visual cue is missing.
|
||
|
||
Master context document. Read this first, then `02_ARCHITECTURE.md` and
|
||
`03_DATA_CONTRACTS.md`. The SparkControl API is fully specified in
|
||
`03_DATA_CONTRACTS.md`.
|
||
|
||
---
|
||
|
||
## 1. What we are building
|
||
|
||
A lightweight, always-running **menu-bar app on macOS** that:
|
||
|
||
1. **Detects** when the user joins a call in Google Meet, Zoom, Microsoft Teams,
|
||
or Signal.
|
||
2. **Records two local audio tracks** — system audio (everyone else) and the
|
||
user's microphone (the user). It sends the backend **dual-channel**
|
||
(`mic_file` + `system_file`) when the system track is healthy, falling back to
|
||
a **mixed-mono 16 kHz WAV** otherwise.
|
||
3. **Watches the call window** at ~2–4 fps and, per app, reads participant
|
||
**names** and the **active-speaker cue**, producing a
|
||
`(start, end, name, confidence)` **visual timeline** — its best guess at who
|
||
was talking when.
|
||
4. **Discards every video frame after extraction.** No video is ever written to
|
||
disk. Only audio + the derived timeline persist locally.
|
||
5. On call end, **POSTs the audio + the visual timeline (+ the known voiceprint
|
||
library) to `POST /api/audio/label-merge`** on SparkControl, which returns
|
||
**named, speaker-attributed transcript segments** and a **voiceprint per
|
||
speaker**.
|
||
6. **Persists the returned voiceprints** keyed by name, so the next call can pass
|
||
them as `known_voiceprints` and recover a speaker by voice when the visual cue
|
||
is absent (camera off, a bad OCR frame).
|
||
7. **Renders the result locally** — a readable `transcript.md` plus an HTML
|
||
`recap.html` (topics + meeting extras, generated via the backend's LLM
|
||
endpoint), with an in-app editor for fixing speaker names after the fact.
|
||
|
||
The app's job ends at producing the named transcript and recap from SparkControl's
|
||
segments. **All transcription, diarization, name-merge, and LLM analysis happen on
|
||
the backend.** Do not build transcription, diarization, or the merge vote in this
|
||
app.
|
||
|
||
## 2. Why the visual timeline still matters (the core idea)
|
||
|
||
Audio diarization (NVIDIA Sortformer on the backend) is excellent at
|
||
**segmentation** — precise speaker boundaries — but its clusters are **anonymous**
|
||
(`Speaker_0`, `Speaker_1`…). It cannot name anyone.
|
||
|
||
The screen *already knows the names*. Each app visually marks the active speaker
|
||
(colored tile border, animated audio bars, a ring around an avatar) next to that
|
||
person's name/initials.
|
||
|
||
So responsibilities split cleanly:
|
||
- **Audio (backend)** owns *segmentation* — the exact *when*.
|
||
- **Visual capture (this app)** owns *identity* — the *who*.
|
||
- **`label-merge` (backend)** fuses them: it diarizes, then assigns each cluster
|
||
the timeline name with the most temporal overlap. The visual track needn't be
|
||
perfect — it only needs to win the per-cluster vote.
|
||
|
||
**New compounding layer — the voiceprint library.** Every named cluster comes
|
||
back from the backend with a 192-dim TitaNet voiceprint. The app persists these
|
||
keyed by name and replays them as `known_voiceprints` next time. Resolution order
|
||
per cluster becomes: **visual overlap → voiceprint match → `Unknown_N`** (never
|
||
mislabeled). So the screen capture *enrolls a voice library for free*, and over a
|
||
few calls the system can name regulars even with cameras off.
|
||
|
||
## 3. Hard scope boundaries
|
||
|
||
**In scope (this app):**
|
||
- Call detection for Meet / Zoom / Teams / Signal.
|
||
- Dual-track local audio capture; **dual-channel send** (mic + system) with a
|
||
mix-to-mono fallback for the backend.
|
||
- Low-fps window capture → OCR (names) + active-speaker cue detection.
|
||
- Per-app "adapter" modules encapsulating each app's UI quirks.
|
||
- Building the visual timeline; **mic-VAD self-labeling** (the mic track is the
|
||
user, so hot-mic spans pre-seed the user's name into the timeline).
|
||
- Chunking long calls (~2–3 min) and calling `label-merge` **sequentially**.
|
||
- A local **voiceprint store** (persist + replay named voiceprints).
|
||
- Storing the backend's named segments and **rendering** them — `transcript.md`
|
||
plus an HTML `recap.html` (recap analysis via the backend LLM) — with an in-app
|
||
speaker-name editor.
|
||
- A minimal menu-bar UI: status, manual start/stop, the last session (reveal,
|
||
resend, open recap, edit speakers), adapter toggles, backend host/health,
|
||
output folder.
|
||
|
||
**Out of scope (owned by the backend):**
|
||
- Transcription, diarization, the name-merge vote, and LLM summarization — these
|
||
run on the backend; the app only orchestrates the recap call and renders the
|
||
result.
|
||
|
||
**Explicitly not doing:** saving video; cloud anything. Everything stays on the
|
||
operator's LAN.
|
||
|
||
## 4. Key decisions (now resolved against the real contract)
|
||
|
||
| Decision | Choice | Why |
|
||
|---|---|---|
|
||
| Language / framework | Native Swift + SwiftUI menu-bar app (`LSUIElement`) | System audio, window capture, Vision all native; one codebase. |
|
||
| Audio capture | ScreenCaptureKit (system audio) + AVFoundation (mic) | No virtual audio device; works with headphones; macOS 13+. |
|
||
| Backend audio format | **Dual-channel (mic + system)** when the system track is healthy, else **mixed-mono 16 kHz WAV** | Separate tracks let the backend attribute the user's mic channel directly; the diarizer can still split the mono fallback. |
|
||
| Call detection | CoreAudio "mic running somewhere" + known-app / Meet-tab heuristic | Clean live-mic signal + app disambiguation. |
|
||
| Speaker naming | **Backend, via `POST /api/audio/label-merge`** | One call does diarize + overlap-vote naming + transcription. No client merge. |
|
||
| Identity recovery | **Local voiceprint library** replayed as `known_voiceprints` | Recovers camera-off / OCR-missed speakers by voice; compounds over calls. |
|
||
| Self-identity | mic-VAD → pre-seed user's name in timeline | The mic track is the user; gives the backend a strong prior + enrolls the user's voiceprint immediately. |
|
||
| Requests | **Sequential, one audio request in flight** | Parallel audio requests trip a backend GPU race (`503 + Retry-After`). |
|
||
| Long calls | Chunk ~2–3 min, sequential, stitch via names+voiceprints | Diarizer caps at **4 speakers/chunk**; voiceprints + names unify across chunks. |
|
||
| Transport / TLS | `multipart/form-data`, file field `file` (mono) or `mic_file` + `system_file` (dual-channel); self-signed Start9 cert (trust the Root CA — supported default; host-scoped skip-verify is an off-by-default escape hatch); **no auth on LAN** | Matches every other SparkControl endpoint. |
|
||
| Timing | Batch after call (sync endpoints, no polling) | Endpoints are synchronous; no job/poll machinery needed. |
|
||
|
||
### On forking Hyprnote
|
||
Unchanged recommendation: the audio capture is the trivial part (~200 lines of
|
||
native Swift) and the rest (Vision screen-reading) is native too. Build native;
|
||
use Hyprnote's capture/detection only as reference. Fork remains an override.
|
||
|
||
## 5. Target apps & identifiers
|
||
|
||
| App | Join via | Bundle ID(s) | Speaking cue / names |
|
||
|---|---|---|---|
|
||
| Zoom | Native | `us.zoom.xos` | Colored tile border; name label in tile. |
|
||
| Microsoft Teams | Native (new) | `com.microsoft.teams2` (new), `com.microsoft.teams` (classic) | Colored ring/border; labeled. |
|
||
| Signal | Native (Electron) | `org.whispersystems.signal-desktop` | Ring around avatar/initials; try Accessibility names first. |
|
||
| Google Meet | Browser tab | `com.google.Chrome`, `com.apple.Safari`, `company.thebrowser.Browser` (Arc)… | Canvas video → Vision for the cue; DOM names → Accessibility/AppleScript; confirm via active-tab URL `meet.google.com`. |
|
||
|
||
Four required adapters; adding a 5th must be one new file conforming to the
|
||
`AppAdapter` protocol.
|
||
|
||
## 6. The backend (do not rebuild) — now concrete
|
||
|
||
SparkControl, on the operator's Start9 LAN, fronting two DGX Sparks:
|
||
- **STT:** NVIDIA Parakeet TDT 0.6B — `POST /v1/audio/transcriptions` (OpenAI-compatible).
|
||
- **Diarization:** NVIDIA Sortformer 4spk-v1 — `POST /api/audio/diarize-chunk`
|
||
(anonymous clusters + voiceprints) and `POST /api/audio/transcribe-with-speakers`.
|
||
- **Embeddings:** NVIDIA TitaNet (192-dim voiceprints).
|
||
- **★ Primary endpoint for this app:** `POST /api/audio/label-merge` — diarize +
|
||
name from the visual timeline (+ voiceprint fallback), optionally transcribe,
|
||
in one synchronous call.
|
||
- **LLM (recap):** Qwen3 via OpenAI-compatible `POST /v1/chat/completions` —
|
||
generates the readable recap (topics + meeting extras) from the transcript.
|
||
- Health/discovery: `GET /api/status`, `GET /api/endpoints`, `GET /v1/models`.
|
||
|
||
Full request/response shapes, curl examples, limits, and error formats are in
|
||
`03_DATA_CONTRACTS.md`.
|
||
|
||
## 7. Settled decisions (were open at brief time)
|
||
|
||
1. **Base URL.** A private LAN host — a `.local` mDNS name (preferred over a raw
|
||
IP, since it survives IP changes) — configured in Settings or via the
|
||
`SPARK_BACKEND_URL` env var, never committed. A neutral placeholder ships as the
|
||
default and stays editable in Settings. Service-discovery at `GET /api/endpoints`.
|
||
2. **Send trigger.** Auto-send on call end is a setting (`autoSendOnStop`), **off
|
||
by default** — the user reviews the session and sends manually unless they opt in.
|
||
3. **Retention.** The session folder is kept after a successful hand-off (output
|
||
location is configurable); nothing is pruned automatically.
|
||
4. **Voiceprint update policy.** Store/refresh the latest high-confidence vector
|
||
per name (`02_ARCHITECTURE.md §2.9`); a per-name running average is a possible
|
||
later refinement.
|
||
5. **Signing.** A stable identity via `Config/Signing.xcconfig` (gitignored) keeps
|
||
macOS from re-prompting for permissions on each rebuild.
|