Files
ten31-transcripts/docs/03_DATA_CONTRACTS.md
T
Grant Gilliam 11eb82178f Add agent instructions; extract signing/backend secrets from source
- Add AGENTS.md (canonical) + CLAUDE.md symlink + ROADMAP.md
- Move Apple Team ID from project.yml into a gitignored
  Config/Signing.xcconfig via configFiles; commit the .example template
- Replace hardcoded backend host in AppSettings with a neutral
  placeholder + SPARK_BACKEND_URL env-var fallback
- Scrub the Team ID, .local host, and raw LAN IP from README/docs
- Ignore Config/Signing.xcconfig and .env
2026-06-13 12:23:54 -05:00

9.2 KiB

Data Contracts — Ten31 Transcripts

Companion to docs 01/02. Defines the files the app produces/stores and the real SparkControl contract (source of truth: AUDIO_API.md). The label-merge endpoint is the app's primary integration point.


1. visual_timeline.json (the app's primary output)

Rich, app-native record of who-was-speaking-when, inferred from the screen. Times are seconds relative to session t0 (the shared audio clock).

{
  "schema_version": "1.1",
  "session_id": "2026-06-05T14-03_zoom",
  "app": "zoom",                       // zoom | teams | signal | meet
  "adapter_version": "zoom-0.3.1",
  "t0_unix": 1749135780.123,
  "duration_sec": 2841.7,
  "fps_sampled": 3,
  "self_name": "Grant",                // user's name; segments from mic VAD
  "participants": [
    { "name": "Grant", "is_self": true },
    { "name": "Sarah Jones", "aliases": ["Sarah J"] },
    { "name": "Dmitri" }
  ],
  "segments": [
    { "start": 0.0,  "end": 4.5, "name": "Grant",       "confidence": 0.97, "source": "mic_vad" },
    { "start": 4.8,  "end": 9.3, "name": "Sarah Jones", "confidence": 0.82, "source": "vision" }
  ],
  "visual_gaps": [
    { "start": 120.4, "end": 138.9, "reason": "minimized" }   // reason: minimized | tab_switched
  ]
}
  • Segments may overlap (crosstalk) — do not flatten.
  • source ∈ {vision, accessibility, fused, mic_vad}.
  • The user (mic-VAD is_self) is pre-seeded so the backend names them and enrolls their voiceprint on call one.

1.1 Mapping to the label-merge timeline field

label-merge wants a flat JSON array of {start, end, name, confidence} (seconds). Build it by taking segments and dropping source:

[{"start":0.0,"end":4.5,"name":"Grant","confidence":0.97},
 {"start":4.8,"end":9.3,"name":"Sarah Jones","confidence":0.82}]

When chunking, slice to the chunk window and rebase to chunk-local seconds (subtract chunk start) before sending.

2. manifest.json

{
  "session_id": "2026-06-05T14-03_zoom",
  "app": "zoom",
  "t0_unix": 1749135780.123,
  "duration_sec": 2841.7,
  "audio": {
    "mixed":  { "file": "mixed_mono_16k.wav", "sr": 16000, "channels": 1, "sha256": "..." },
    "mic":    { "file": "mic.wav",    "sr": 16000, "channels": 1, "sha256": "..." },
    "system": { "file": "system.wav", "sr": 16000, "channels": 1, "sha256": "..." }
  },
  "chunking": { "enabled": true, "chunk_sec": 150, "chunks": 19 },
  "visual_timeline": "visual_timeline.json",
  "backend_result": "speakers.json",
  "adapter_versions": { "zoom": "0.3.1" },
  "app_version": "0.1.0"
}

(mixed_mono_16k.wav is the one the backend gets; the separate tracks are kept locally — the mic track is the user's known identity / VAD source.)


3. SparkControl — connection (real)

  • Base URL (confirmed): a private LAN host — a .local mDNS name (preferred over a raw IP; it survives IP changes) — configured in Settings or via the SPARK_BACKEND_URL env var, and never committed. Service-discovery JSON is at GET /api/endpoints (returns current vLLM / Parakeet / Kokoro URLs). All audio endpoints in §4–§5 hang off this base. Make it a setting so the host can change, and ship a neutral placeholder (https://your-spark-backend.local) as the default.
  • TLS: Start9 self-signed Root CA. Either skip verification (URLSession delegate trusting the cert; curl -k; rejectUnauthorized:false) or install the Start9 Root CA into the trust store.
  • Auth: none on the LAN. No token/key today.
  • Limits: 200 MB/request (413 over); timeouts ~300 s (transcription), ~600 s (diarization). Send audio requests SEQUENTIALLY — concurrent audio trips a GPU FFT race → 503 + Retry-After.
  • Transport: multipart/form-data, audio file field name file (bytes, not base64/path).
  • All endpoints are synchronous (no job IDs / polling).
  • Errors: JSON {"detail": "..."}; 400 malformed, 413 too large, 503 + Retry-After transient (retry after the interval).
  • Health/discovery: GET /api/status, GET /api/endpoints, GET /v1/models.

4. ★ POST /api/audio/label-merge — the app's primary call

Diarize + name clusters from the visual timeline (majority temporal overlap), with voiceprint fallback, optionally transcribed. Synchronous. Stateless — the app owns the timeline and the voiceprint library.

Multipart fields:

field required notes
file yes mixed-mono WAV (the chunk, when chunking)
timeline yes flat JSON array [{"start","end","name","confidence"}], chunk-local seconds (§1.1)
known_voiceprints no JSON {"<name>":[192 floats], ...} from VoiceprintStore
transcribe no "true" to also return per-segment text (default false)
min_overlap no min fraction of a cluster's time overlapping the winning name (default 0.0)
voiceprint_threshold no cosine sim to accept a voiceprint match (default 0.5)
curl -k -X POST https://<host>/api/audio/label-merge \
  -F "file=@chunk_003.wav" \
  -F 'timeline=[{"start":0,"end":4.5,"name":"Grant"},{"start":4.8,"end":9.3,"name":"Sarah Jones"}]' \
  -F 'known_voiceprints={"Grant":[/*192*/],"Sarah Jones":[/*192*/]}' \
  -F "transcribe=true"

Response (transcribe=true):

{
  "duration": 9.259,
  "speakers": [
    { "cluster": "Speaker_0", "name": "Grant",       "source": "visual",
      "overlap_confidence": 0.9866, "fingerprint": [/*192 floats*/] },
    { "cluster": "Speaker_1", "name": "Sarah Jones", "source": "voiceprint",
      "match_similarity": 0.71,     "fingerprint": [/*192 floats*/] }
  ],
  "segments": [
    { "start_ms": 0,    "end_ms": 4480, "speaker": "Grant",
      "text": "Good morning everyone. I think the energy thesis is strong this quarter." },
    { "start_ms": 4800, "end_ms": 9040, "speaker": "Sarah Jones",
      "text": "I agree, but I am worried about the lockup terms and the fee load this time." }
  ],
  "fingerprints": { "Grant": [/*192*/], "Sarah Jones": [/*192*/] },
  "models": { "diarization": "nvidia/diar_sortformer_4spk-v1",
              "embedding": "nvidia/speakerverification_en_titanet_large" }
}
  • transcribe=false → segments are {start_s, end_s, speaker, confidence} (no text).
  • Resolution order per cluster: visualvoiceprint (with match_similarity) → Unknown_N (source:"unmatched", never mislabeled).
  • Persist fingerprints keyed by name into VoiceprintStore (skip Unknown_N).
  • Diarizer caps at 4 speakers/chunk and takes no num_speakers hint — chunk for larger/longer calls; names + voiceprints unify speakers across chunks.

5. Other SparkControl endpoints (reference / fallback)

5.1 POST /v1/audio/transcriptions (OpenAI-compatible, sync)

Plain STT. Fields: file (req), model (default parakeet-tdt-0.6b-v3), response_format (json | verbose_json | text), language, temperature, prompt. verbose_json returns word- and segment-level timestamps + full text. Use only if you ever want transcript without speaker labels.

5.2 POST /api/audio/diarize-chunk (sync)

Field: file. Returns anonymous clusters {start_s, end_s, speaker, confidence}, speakers_detected, and a 192-dim fingerprints map per local speaker. Use if you ever want to do the merge client-side instead of via label-merge (not the default path).

5.3 POST /api/audio/transcribe-with-speakers (sync)

Field: file. ASR + diarization merged into anonymous speaker-attributed blocks {start_ms, end_ms, speaker, text}. (Anonymous — no naming. label-merge is the named version.)


6. speakers.json — final stored output (target end state)

Per-chunk label-merge results concatenated, timestamps offset back to session (global) seconds, names unified across chunks. This is the hand-off to the user's downstream summarizer; the app stops here.

{
  "session_id": "2026-06-05T14-03_zoom",
  "app": "zoom",
  "duration_sec": 2841.7,
  "speakers": [
    { "name": "Grant",       "source": "visual",     "overlap_confidence": 0.99 },
    { "name": "Sarah Jones", "source": "voiceprint", "match_similarity": 0.71 },
    { "name": "Unknown_0",   "source": "unmatched" }
  ],
  "segments": [
    { "start": 0.0,   "end": 4.48,  "speaker": "Grant",       "text": "..." },
    { "start": 4.80,  "end": 9.04,  "speaker": "Sarah Jones", "text": "..." },
    { "start": 152.3, "end": 158.1, "speaker": "Unknown_0",   "text": "..." }
  ],
  "models": { "diarization": "nvidia/diar_sortformer_4spk-v1",
              "embedding": "nvidia/speakerverification_en_titanet_large",
              "transcription": "parakeet-tdt-0.6b-v3" }
}

(Convert backend start_ms/end_ms → seconds; add the chunk offset.)

7. voiceprints.json — local voiceprint library

{
  "Grant":       { "vector": [/*192 floats*/], "updated": "2026-06-05T14:51Z", "calls": 12 },
  "Sarah Jones": { "vector": [/*192 floats*/], "updated": "2026-06-02T09:10Z", "calls": 3 }
}

Loaded → known_voiceprints on every label-merge call. Updated from response fingerprints for visual/high-confidence voiceprint speakers only. Never stores Unknown_N. Update policy (02 §2.9): start = store latest with overlap_confidence ≥ ~0.8; consider per-name running mean later.