Reconcile docs/ specs with the shipped app

Document the dual-channel label-merge path (mic_file/system_file/self_name/self_vad) and the recap phase (transcript.md + recap.html via the backend LLM) across docs/01-03; correct docs/02 $2.10 to the UI actually shipped; mark docs/01 $7 open items as settled; remove the dead AUDIO_API.md references; note the manifest sha256 fields are not emitted; mark docs/04 as a complete/historical build log. Also drop the last stale "Phase 0" UI string in MenuBarView and retire the now-done doc-debt items in ROADMAP.
2026-06-16 22:09:04 -05:00
parent 85ea8fde45
commit dda4322de7
6 changed files with 106 additions and 56 deletions
@@ -1,7 +1,7 @@
 # Data Contracts — Ten31 Transcripts

 Companion to docs 01/02. Defines the files the app produces/stores and the **real
-SparkControl contract** (source of truth: `AUDIO_API.md`). The `label-merge`
+SparkControl contract** (verified against the live backend). The `label-merge`
 endpoint is the app's primary integration point.

 ---
@@ -69,8 +69,10 @@ When chunking, **slice to the chunk window and rebase to chunk-local seconds**
  "app_version": "0.1.0"
 }
 ```
-(`mixed_mono_16k.wav` is the one the backend gets; the separate tracks are kept
-locally — the mic track is the user's known identity / VAD source.)
+(On the dual-channel path the backend gets `mic.wav` + `system.wav` directly; on
+the mono fallback it gets `mixed_mono_16k.wav`. The mic track is the user's known
+identity / VAD source. **Note:** the per-file `sha256` fields above are part of the
+intended contract but are **not currently emitted** by the pipeline.)

 ---

@@ -83,15 +85,17 @@ locally — the mic track is the user's known identity / VAD source.)
  endpoints in §4–§5 hang off this base. **Make it a setting** so the host can
  change, and ship a neutral placeholder (`https://your-spark-backend.local`) as
  the default.
- **TLS:** Start9 self-signed Root CA. Either skip verification (`URLSession`
-  delegate trusting the cert; curl `-k`; `rejectUnauthorized:false`) **or** install
-  the Start9 Root CA into the trust store.
+- **TLS:** Start9 self-signed Root CA. Supported path: install the Start9 Root CA
+  into the System keychain (default trust then succeeds). Skip-verification is an
+  **off-by-default, host-scoped** escape hatch (`InsecureTrustDelegate`, scoped to
+  the configured backend host), not the default.
 - **Auth:** **none on the LAN.** No token/key today.
 - **Limits:** **200 MB/request** (`413` over); timeouts ~300 s (transcription),
  ~600 s (diarization). **Send audio requests SEQUENTIALLY** — concurrent audio
  trips a GPU FFT race → `503 + Retry-After`.
- **Transport:** `multipart/form-data`, audio file field name **`file`** (bytes,
-  not base64/path).
+- **Transport:** `multipart/form-data`. Audio file field is **`file`** on the mono
+  path, or **`mic_file`** + **`system_file`** on the dual-channel path (bytes, not
+  base64/path).
 - **All endpoints are synchronous** (no job IDs / polling).
 - **Errors:** JSON `{"detail": "..."}`; `400` malformed, `413` too large, `503 +
  Retry-After` transient (retry after the interval).
@@ -105,11 +109,16 @@ Diarize + name clusters from the visual timeline (majority temporal overlap),
 with voiceprint fallback, optionally transcribed. Synchronous. **Stateless** —
 the app owns the timeline and the voiceprint library.

-**Multipart fields:**
+**Multipart fields** — two audio shapes: **mono** (`file`) or **dual-channel**
+(`mic_file` + `system_file`, preferred when the system track is healthy):
 | field | required | notes |
 |---|---|---|
-| `file` | **yes** | mixed-mono WAV (the chunk, when chunking) |
-| `timeline` | **yes** | flat JSON array `[{"start","end","name","confidence"}]`, chunk-local seconds (§1.1) |
+| `file` | mono path | mixed-mono WAV (the chunk, when chunking) |
+| `mic_file` | dual path | the user's mic track (chunk) — attributed to `self_name` |
+| `system_file` | dual path | the remote/system track (chunk) |
+| `self_name` | dual path | the user's name; the mic channel is attributed to them |
+| `self_vad` | no | chunk-local windows where the mic is genuinely the user (active + louder than system) |
+| `timeline` | **yes** | flat JSON array `[{"start","end","name","confidence"}]`, chunk-local seconds (§1.1); on the dual path it names only the remote speakers |
 | `known_voiceprints` | no | JSON `{"<name>":[192 floats], ...}` from `VoiceprintStore` |
 | `transcribe` | no | `"true"` to also return per-segment text (default false) |
 | `min_overlap` | no | min fraction of a cluster's time overlapping the winning name (default `0.0`) |
@@ -213,3 +222,11 @@ Loaded → `known_voiceprints` on every `label-merge` call. Updated from respons
 `fingerprints` for `visual`/high-confidence `voiceprint` speakers only. Never
 stores `Unknown_N`. Update policy (`02 §2.9`): start = store latest with
 `overlap_confidence ≥ ~0.8`; consider per-name running mean later.
+
+## 8. Recap outputs (`transcript.md`, `recap.{html,json}`)
+After `speakers.json` is assembled, the recap phase renders the human-readable
+deliverables: a `transcript.md` (one line per diarized utterance) and an HTML
+`recap.html`, backed by a structured `recap.json`. The recap's topic/summary
+content is generated by the **backend LLM** (`POST /v1/chat/completions`, Qwen3);
+the app owns the rendering and the in-app **speaker-name editor**, which can rewrite
+names across `speakers.json`, the transcript, and the recap after the fact.