Files
ten31-transcripts/docs/03_DATA_CONTRACTS.md
T
Grant Gilliam dda4322de7 Reconcile docs/ specs with the shipped app
Document the dual-channel label-merge path (mic_file/system_file/self_name/self_vad) and the recap phase (transcript.md + recap.html via the backend LLM) across docs/01-03; correct docs/02 $2.10 to the UI actually shipped; mark docs/01 $7 open items as settled; remove the dead AUDIO_API.md references; note the manifest sha256 fields are not emitted; mark docs/04 as a complete/historical build log. Also drop the last stale "Phase 0" UI string in MenuBarView and retire the now-done doc-debt items in ROADMAP.
2026-06-16 22:09:04 -05:00

11 KiB

Data Contracts — Ten31 Transcripts

Companion to docs 01/02. Defines the files the app produces/stores and the real SparkControl contract (verified against the live backend). The label-merge endpoint is the app's primary integration point.


1. visual_timeline.json (the app's primary output)

Rich, app-native record of who-was-speaking-when, inferred from the screen. Times are seconds relative to session t0 (the shared audio clock).

{
  "schema_version": "1.1",
  "session_id": "2026-06-05T14-03_zoom",
  "app": "zoom",                       // zoom | teams | signal | meet
  "adapter_version": "zoom-0.3.1",
  "t0_unix": 1749135780.123,
  "duration_sec": 2841.7,
  "fps_sampled": 3,
  "self_name": "Grant",                // user's name; segments from mic VAD
  "participants": [
    { "name": "Grant", "is_self": true },
    { "name": "Sarah Jones", "aliases": ["Sarah J"] },
    { "name": "Dmitri" }
  ],
  "segments": [
    { "start": 0.0,  "end": 4.5, "name": "Grant",       "confidence": 0.97, "source": "mic_vad" },
    { "start": 4.8,  "end": 9.3, "name": "Sarah Jones", "confidence": 0.82, "source": "vision" }
  ],
  "visual_gaps": [
    { "start": 120.4, "end": 138.9, "reason": "minimized" }   // reason: minimized | tab_switched
  ]
}
  • Segments may overlap (crosstalk) — do not flatten.
  • source ∈ {vision, accessibility, fused, mic_vad}.
  • The user (mic-VAD is_self) is pre-seeded so the backend names them and enrolls their voiceprint on call one.

1.1 Mapping to the label-merge timeline field

label-merge wants a flat JSON array of {start, end, name, confidence} (seconds). Build it by taking segments and dropping source:

[{"start":0.0,"end":4.5,"name":"Grant","confidence":0.97},
 {"start":4.8,"end":9.3,"name":"Sarah Jones","confidence":0.82}]

When chunking, slice to the chunk window and rebase to chunk-local seconds (subtract chunk start) before sending.

2. manifest.json

{
  "session_id": "2026-06-05T14-03_zoom",
  "app": "zoom",
  "t0_unix": 1749135780.123,
  "duration_sec": 2841.7,
  "audio": {
    "mixed":  { "file": "mixed_mono_16k.wav", "sr": 16000, "channels": 1, "sha256": "..." },
    "mic":    { "file": "mic.wav",    "sr": 16000, "channels": 1, "sha256": "..." },
    "system": { "file": "system.wav", "sr": 16000, "channels": 1, "sha256": "..." }
  },
  "chunking": { "enabled": true, "chunk_sec": 150, "chunks": 19 },
  "visual_timeline": "visual_timeline.json",
  "backend_result": "speakers.json",
  "adapter_versions": { "zoom": "0.3.1" },
  "app_version": "0.1.0"
}

(On the dual-channel path the backend gets mic.wav + system.wav directly; on the mono fallback it gets mixed_mono_16k.wav. The mic track is the user's known identity / VAD source. Note: the per-file sha256 fields above are part of the intended contract but are not currently emitted by the pipeline.)


3. SparkControl — connection (real)

  • Base URL (confirmed): a private LAN host — a .local mDNS name (preferred over a raw IP; it survives IP changes) — configured in Settings or via the SPARK_BACKEND_URL env var, and never committed. Service-discovery JSON is at GET /api/endpoints (returns current vLLM / Parakeet / Kokoro URLs). All audio endpoints in §4–§5 hang off this base. Make it a setting so the host can change, and ship a neutral placeholder (https://your-spark-backend.local) as the default.
  • TLS: Start9 self-signed Root CA. Supported path: install the Start9 Root CA into the System keychain (default trust then succeeds). Skip-verification is an off-by-default, host-scoped escape hatch (InsecureTrustDelegate, scoped to the configured backend host), not the default.
  • Auth: none on the LAN. No token/key today.
  • Limits: 200 MB/request (413 over); timeouts ~300 s (transcription), ~600 s (diarization). Send audio requests SEQUENTIALLY — concurrent audio trips a GPU FFT race → 503 + Retry-After.
  • Transport: multipart/form-data. Audio file field is file on the mono path, or mic_file + system_file on the dual-channel path (bytes, not base64/path).
  • All endpoints are synchronous (no job IDs / polling).
  • Errors: JSON {"detail": "..."}; 400 malformed, 413 too large, 503 + Retry-After transient (retry after the interval).
  • Health/discovery: GET /api/status, GET /api/endpoints, GET /v1/models.

4. ★ POST /api/audio/label-merge — the app's primary call

Diarize + name clusters from the visual timeline (majority temporal overlap), with voiceprint fallback, optionally transcribed. Synchronous. Stateless — the app owns the timeline and the voiceprint library.

Multipart fields — two audio shapes: mono (file) or dual-channel (mic_file + system_file, preferred when the system track is healthy):

field required notes
file mono path mixed-mono WAV (the chunk, when chunking)
mic_file dual path the user's mic track (chunk) — attributed to self_name
system_file dual path the remote/system track (chunk)
self_name dual path the user's name; the mic channel is attributed to them
self_vad no chunk-local windows where the mic is genuinely the user (active + louder than system)
timeline yes flat JSON array [{"start","end","name","confidence"}], chunk-local seconds (§1.1); on the dual path it names only the remote speakers
known_voiceprints no JSON {"<name>":[192 floats], ...} from VoiceprintStore
transcribe no "true" to also return per-segment text (default false)
min_overlap no min fraction of a cluster's time overlapping the winning name (default 0.0)
voiceprint_threshold no cosine sim to accept a voiceprint match (default 0.5)
curl -k -X POST https://<host>/api/audio/label-merge \
  -F "file=@chunk_003.wav" \
  -F 'timeline=[{"start":0,"end":4.5,"name":"Grant"},{"start":4.8,"end":9.3,"name":"Sarah Jones"}]' \
  -F 'known_voiceprints={"Grant":[/*192*/],"Sarah Jones":[/*192*/]}' \
  -F "transcribe=true"

Response (transcribe=true):

{
  "duration": 9.259,
  "speakers": [
    { "cluster": "Speaker_0", "name": "Grant",       "source": "visual",
      "overlap_confidence": 0.9866, "fingerprint": [/*192 floats*/] },
    { "cluster": "Speaker_1", "name": "Sarah Jones", "source": "voiceprint",
      "match_similarity": 0.71,     "fingerprint": [/*192 floats*/] }
  ],
  "segments": [
    { "start_ms": 0,    "end_ms": 4480, "speaker": "Grant",
      "text": "Good morning everyone. I think the energy thesis is strong this quarter." },
    { "start_ms": 4800, "end_ms": 9040, "speaker": "Sarah Jones",
      "text": "I agree, but I am worried about the lockup terms and the fee load this time." }
  ],
  "fingerprints": { "Grant": [/*192*/], "Sarah Jones": [/*192*/] },
  "models": { "diarization": "nvidia/diar_sortformer_4spk-v1",
              "embedding": "nvidia/speakerverification_en_titanet_large" }
}
  • transcribe=false → segments are {start_s, end_s, speaker, confidence} (no text).
  • Resolution order per cluster: visualvoiceprint (with match_similarity) → Unknown_N (source:"unmatched", never mislabeled).
  • Persist fingerprints keyed by name into VoiceprintStore (skip Unknown_N).
  • Diarizer caps at 4 speakers/chunk and takes no num_speakers hint — chunk for larger/longer calls; names + voiceprints unify speakers across chunks.

5. Other SparkControl endpoints (reference / fallback)

5.1 POST /v1/audio/transcriptions (OpenAI-compatible, sync)

Plain STT. Fields: file (req), model (default parakeet-tdt-0.6b-v3), response_format (json | verbose_json | text), language, temperature, prompt. verbose_json returns word- and segment-level timestamps + full text. Use only if you ever want transcript without speaker labels.

5.2 POST /api/audio/diarize-chunk (sync)

Field: file. Returns anonymous clusters {start_s, end_s, speaker, confidence}, speakers_detected, and a 192-dim fingerprints map per local speaker. Use if you ever want to do the merge client-side instead of via label-merge (not the default path).

5.3 POST /api/audio/transcribe-with-speakers (sync)

Field: file. ASR + diarization merged into anonymous speaker-attributed blocks {start_ms, end_ms, speaker, text}. (Anonymous — no naming. label-merge is the named version.)


6. speakers.json — final stored output (target end state)

Per-chunk label-merge results concatenated, timestamps offset back to session (global) seconds, names unified across chunks. This is the hand-off to the user's downstream summarizer; the app stops here.

{
  "session_id": "2026-06-05T14-03_zoom",
  "app": "zoom",
  "duration_sec": 2841.7,
  "speakers": [
    { "name": "Grant",       "source": "visual",     "overlap_confidence": 0.99 },
    { "name": "Sarah Jones", "source": "voiceprint", "match_similarity": 0.71 },
    { "name": "Unknown_0",   "source": "unmatched" }
  ],
  "segments": [
    { "start": 0.0,   "end": 4.48,  "speaker": "Grant",       "text": "..." },
    { "start": 4.80,  "end": 9.04,  "speaker": "Sarah Jones", "text": "..." },
    { "start": 152.3, "end": 158.1, "speaker": "Unknown_0",   "text": "..." }
  ],
  "models": { "diarization": "nvidia/diar_sortformer_4spk-v1",
              "embedding": "nvidia/speakerverification_en_titanet_large",
              "transcription": "parakeet-tdt-0.6b-v3" }
}

(Convert backend start_ms/end_ms → seconds; add the chunk offset.)

7. voiceprints.json — local voiceprint library

{
  "Grant":       { "vector": [/*192 floats*/], "updated": "2026-06-05T14:51Z", "calls": 12 },
  "Sarah Jones": { "vector": [/*192 floats*/], "updated": "2026-06-02T09:10Z", "calls": 3 }
}

Loaded → known_voiceprints on every label-merge call. Updated from response fingerprints for visual/high-confidence voiceprint speakers only. Never stores Unknown_N. Update policy (02 §2.9): start = store latest with overlap_confidence ≥ ~0.8; consider per-name running mean later.

8. Recap outputs (transcript.md, recap.{html,json})

After speakers.json is assembled, the recap phase renders the human-readable deliverables: a transcript.md (one line per diarized utterance) and an HTML recap.html, backed by a structured recap.json. The recap's topic/summary content is generated by the backend LLM (POST /v1/chat/completions, Qwen3); the app owns the rendering and the in-app speaker-name editor, which can rewrite names across speakers.json, the transcript, and the recap after the fact.