Reconcile docs/ specs with the shipped app

Document the dual-channel label-merge path (mic_file/system_file/self_name/self_vad) and the recap phase (transcript.md + recap.html via the backend LLM) across docs/01-03; correct docs/02 $2.10 to the UI actually shipped; mark docs/01 $7 open items as settled; remove the dead AUDIO_API.md references; note the manifest sha256 fields are not emitted; mark docs/04 as a complete/historical build log. Also drop the last stale "Phase 0" UI string in MenuBarView and retire the now-done doc-debt items in ROADMAP.
2026-06-16 22:09:04 -05:00
parent 85ea8fde45
commit dda4322de7
6 changed files with 106 additions and 56 deletions
@@ -7,9 +7,9 @@
 > returns named transcript segments. A growing **voiceprint library** recovers
 > speakers even when the visual cue is missing.

-Master context document. Read this first, then `02_ARCHITECTURE.md`,
-`03_DATA_CONTRACTS.md`, `04_BUILD_PLAN.md`. The SparkControl API is now fully
-specified — see `03_DATA_CONTRACTS.md` (and the source `AUDIO_API.md`).
+Master context document. Read this first, then `02_ARCHITECTURE.md` and
+`03_DATA_CONTRACTS.md`. The SparkControl API is fully specified in
+`03_DATA_CONTRACTS.md`.

 ---

@@ -20,25 +20,30 @@ A lightweight, always-running **menu-bar app on macOS** that:
 1. **Detects** when the user joins a call in Google Meet, Zoom, Microsoft Teams,
   or Signal.
 2. **Records two local audio tracks** — system audio (everyone else) and the
-   user's microphone (the user) — and **mixes them to one 16 kHz mono WAV** for
-   the backend.
+   user's microphone (the user). It sends the backend **dual-channel**
+   (`mic_file` + `system_file`) when the system track is healthy, falling back to
+   a **mixed-mono 16 kHz WAV** otherwise.
 3. **Watches the call window** at ~2–4 fps and, per app, reads participant
   **names** and the **active-speaker cue**, producing a
   `(start, end, name, confidence)` **visual timeline** — its best guess at who
   was talking when.
 4. **Discards every video frame after extraction.** No video is ever written to
   disk. Only audio + the derived timeline persist locally.
-5. On call end, **POSTs the mixed audio + the visual timeline (+ the known
-   voiceprint library) to `POST /api/audio/label-merge`** on SparkControl, which
-   returns **named, speaker-attributed transcript segments** and a **voiceprint
-   per speaker**.
+5. On call end, **POSTs the audio + the visual timeline (+ the known voiceprint
+   library) to `POST /api/audio/label-merge`** on SparkControl, which returns
+   **named, speaker-attributed transcript segments** and a **voiceprint per
+   speaker**.
 6. **Persists the returned voiceprints** keyed by name, so the next call can pass
   them as `known_voiceprints` and recover a speaker by voice when the visual cue
   is absent (camera off, a bad OCR frame).
+7. **Renders the result locally** — a readable `transcript.md` plus an HTML
+   `recap.html` (topics + meeting extras, generated via the backend's LLM
+   endpoint), with an in-app editor for fixing speaker names after the fact.

-The app's job ends at receiving and storing the named segments from SparkControl.
-**All transcription, diarization, and the name-merge happen on the backend.** Do
-not build transcription, diarization, or the merge vote in this app.
+The app's job ends at producing the named transcript and recap from SparkControl's
+segments. **All transcription, diarization, name-merge, and LLM analysis happen on
+the backend.** Do not build transcription, diarization, or the merge vote in this
+app.

 ## 2. Why the visual timeline still matters (the core idea)

@@ -68,19 +73,25 @@ few calls the system can name regulars even with cameras off.

 **In scope (this app):**
 - Call detection for Meet / Zoom / Teams / Signal.
- Dual-track local audio capture + mix-to-mono for the backend.
+- Dual-track local audio capture; **dual-channel send** (mic + system) with a
+  mix-to-mono fallback for the backend.
 - Low-fps window capture → OCR (names) + active-speaker cue detection.
 - Per-app "adapter" modules encapsulating each app's UI quirks.
 - Building the visual timeline; **mic-VAD self-labeling** (the mic track is the
  user, so hot-mic spans pre-seed the user's name into the timeline).
 - Chunking long calls (~2–3 min) and calling `label-merge` **sequentially**.
 - A local **voiceprint store** (persist + replay named voiceprints).
- Storing the backend's named transcript segments locally.
- A minimal menu-bar UI: status, manual start/stop, recent sessions, adapter
-  toggles, backend host/health, output folder.
+- Storing the backend's named segments and **rendering** them — `transcript.md`
+  plus an HTML `recap.html` (recap analysis via the backend LLM) — with an in-app
+  speaker-name editor.
+- A minimal menu-bar UI: status, manual start/stop, the last session (reveal,
+  resend, open recap, edit speakers), adapter toggles, backend host/health,
+  output folder.

 **Out of scope (owned by the backend):**
- Transcription, diarization, the name-merge vote, summarization/analysis.
+- Transcription, diarization, the name-merge vote, and LLM summarization — these
+  run on the backend; the app only orchestrates the recap call and renders the
+  result.

 **Explicitly not doing:** saving video; cloud anything. Everything stays on the
 operator's LAN.
@@ -91,14 +102,14 @@ operator's LAN.
 |---|---|---|
 | Language / framework | Native Swift + SwiftUI menu-bar app (`LSUIElement`) | System audio, window capture, Vision all native; one codebase. |
 | Audio capture | ScreenCaptureKit (system audio) + AVFoundation (mic) | No virtual audio device; works with headphones; macOS 13+. |
-| Backend audio format | **Mixed-mono 16 kHz WAV** | Diarizer separates speakers from one mixed stream; 16 kHz is ideal. |
+| Backend audio format | **Dual-channel (mic + system)** when the system track is healthy, else **mixed-mono 16 kHz WAV** | Separate tracks let the backend attribute the user's mic channel directly; the diarizer can still split the mono fallback. |
 | Call detection | CoreAudio "mic running somewhere" + known-app / Meet-tab heuristic | Clean live-mic signal + app disambiguation. |
 | Speaker naming | **Backend, via `POST /api/audio/label-merge`** | One call does diarize + overlap-vote naming + transcription. No client merge. |
 | Identity recovery | **Local voiceprint library** replayed as `known_voiceprints` | Recovers camera-off / OCR-missed speakers by voice; compounds over calls. |
 | Self-identity | mic-VAD → pre-seed user's name in timeline | The mic track is the user; gives the backend a strong prior + enrolls the user's voiceprint immediately. |
 | Requests | **Sequential, one audio request in flight** | Parallel audio requests trip a backend GPU race (`503 + Retry-After`). |
 | Long calls | Chunk ~2–3 min, sequential, stitch via names+voiceprints | Diarizer caps at **4 speakers/chunk**; voiceprints + names unify across chunks. |
-| Transport / TLS | `multipart/form-data`, file field `file`; self-signed Start9 cert (skip verify or trust the Root CA); **no auth on LAN** | Matches every other SparkControl endpoint. |
+| Transport / TLS | `multipart/form-data`, file field `file` (mono) or `mic_file` + `system_file` (dual-channel); self-signed Start9 cert (trust the Root CA — supported default; host-scoped skip-verify is an off-by-default escape hatch); **no auth on LAN** | Matches every other SparkControl endpoint. |
 | Timing | Batch after call (sync endpoints, no polling) | Endpoints are synchronous; no job/poll machinery needed. |

 ### On forking Hyprnote
@@ -128,25 +139,25 @@ SparkControl, on the operator's Start9 LAN, fronting two DGX Sparks:
 - **★ Primary endpoint for this app:** `POST /api/audio/label-merge` — diarize +
  name from the visual timeline (+ voiceprint fallback), optionally transcribe,
  in one synchronous call.
+- **LLM (recap):** Qwen3 via OpenAI-compatible `POST /v1/chat/completions` —
+  generates the readable recap (topics + meeting extras) from the transcript.
 - Health/discovery: `GET /api/status`, `GET /api/endpoints`, `GET /v1/models`.

 Full request/response shapes, curl examples, limits, and error formats are in
 `03_DATA_CONTRACTS.md`.

-## 7. Remaining open items (small)
+## 7. Settled decisions (were open at brief time)

-1. **Base URL — RESOLVED.** A private LAN host — a `.local` mDNS name (preferred
-   over a raw IP, since it survives IP changes) — configured in Settings or via the
-   `SPARK_BACKEND_URL` env var, and never committed. Ship a neutral placeholder as
-   the default; keep it editable in settings. Service-discovery at
-   `GET /api/endpoints`.
-2. **Send trigger** — assume auto-POST on call end; expose a "hold for review"
-   toggle if the user wants to eyeball the timeline first.
-3. **Retention** — keep the session folder after a successful hand-off, or prune
-   audio and keep only `speakers.json` + voiceprints? Default: keep everything,
-   user-configurable.
-4. **Voiceprint update policy** — overwrite vs running-average a person's stored
-   voiceprint across calls (see `02_ARCHITECTURE.md §2.9`). Start simple
-   (store/refresh latest high-confidence), refine later.
-5. **Signing** — stable identity so macOS doesn't re-prompt for permissions on
-   each rebuild.
+1. **Base URL.** A private LAN host — a `.local` mDNS name (preferred over a raw
+   IP, since it survives IP changes) — configured in Settings or via the
+   `SPARK_BACKEND_URL` env var, never committed. A neutral placeholder ships as the
+   default and stays editable in Settings. Service-discovery at `GET /api/endpoints`.
+2. **Send trigger.** Auto-send on call end is a setting (`autoSendOnStop`), **off
+   by default** — the user reviews the session and sends manually unless they opt in.
+3. **Retention.** The session folder is kept after a successful hand-off (output
+   location is configurable); nothing is pruned automatically.
+4. **Voiceprint update policy.** Store/refresh the latest high-confidence vector
+   per name (`02_ARCHITECTURE.md §2.9`); a per-name running average is a possible
+   later refinement.
+5. **Signing.** A stable identity via `Config/Signing.xcconfig` (gitignored) keeps
+   macOS from re-prompting for permissions on each rebuild.