Files
Keysat 8d839e3714 v0.13.0:4 - redaction gateway, embeddings proxy, expanded audio API
- Add redaction gateway (redaction_gateway.py, redaction/ scrub + tests)
- Add embeddings proxy and spark_embed service (Dockerfile + main.py)
- Expand audio_proxy with speaker-aware handling; deep_health/health/server updates
- Package: configureSparks action + sparkConfig model updates, manifest/main wiring
- Docs: AUDIO_API, EMBEDDINGS, REDACTION_GATEWAY; HANDOFF and runbook/known-issues refresh
2026-06-11 17:45:57 -05:00

14 KiB
Raw Permalink Blame History

Spark Control — Audio API reference (transcription + diarization)

For external clients (e.g. the Ten31 Transcripts capture app) integrating with the transcription and diarization endpoints. All examples are real responses from the live deployment.


1. Connection / auth

  • Base URL: https://<spark-control-host> (the operator's Start9 LAN address, e.g. https://<spark-control-host>:62419). A .local form also exists (survives IP changes); the operator can provide it.
  • TLS: Start9's self-signed Root CA. On the LAN, set verify=False / rejectUnauthorized:false (curl -k), or install the Start9 Root CA into your trust store. Same story as every other Spark Control endpoint.
  • Auth: none on the LAN today — the endpoints sit behind StartOS access control + TLS on a trusted network. No bearer token / API key. (If you need per-client auth later, the operator can add it; it's not there now.)
  • Limits:
    • Max upload: 200 MB per request (413 if exceeded). For long calls, chunk (see §4).
    • Request timeout: transcription ~300 s, diarization ~600 s per request.
    • Send requests sequentially, not in parallel. Concurrent audio requests can trip a GPU FFT race on the backend (returns 503 + Retry-After). One in flight at a time is the safe pattern.

2. Audio format

  • The backend (NVIDIA Parakeet / NeMo) decodes common formats (WAV, FLAC, MP3, m4a) and internally resamples to 16 kHz mono. So 16 kHz mono WAV is ideal; anything decodable also works (stereo is downmixed).
  • Single mixed-mono file is what the endpoints expect. Diarization (Sortformer) separates speakers from one mixed stream, so for diarization you want everyone in one file — mix your system-audio track + mic track to one mono WAV before sending.
  • Your two-track capture is an asset: your mic track is, by definition, you — a known identity. Two clean options:
    1. Mix both tracks → diarize the mix (simplest; pairs perfectly with your visual-timeline name-merge — see §4 note).
    2. Diarize only the system track (the other participants) and label your own mic track as the user directly (no diarization needed for your own voice).
  • Upload mechanism: multipart/form-data, file field name file (OpenAI- compatible). Not base64, not a path/URL — send the bytes.

3. Transcription endpoint

POST /v1/audio/transcriptions — OpenAI-compatible. Synchronous (returns the result; no job/polling).

Multipart fields:

field required notes
file yes the audio bytes
model no default parakeet-tdt-0.6b-v3 (one STT model server-side; you don't need to pick)
response_format no json (default, just text) · verbose_json (timestamps) · text
language no default auto/en
temperature, prompt no passthrough
curl -k -X POST https://<host>/v1/audio/transcriptions \
  -F "file=@call.wav" -F "response_format=verbose_json"

Real verbose_json response — includes word-level AND segment-level timestamps (seconds):

{
  "task": "transcribe",
  "language": "en",
  "duration": 9.259,
  "text": "Good morning everyone. I think the energy thesis is strong this quarter. I agree, but I am worried about the lockup terms and the fee load this time.",
  "segments": [
    { "start": 0.0,  "end": 1.28, "text": "Good morning everyone." },
    { "start": 1.44, "end": 4.48, "text": "I think the energy thesis is strong this quarter." }
  ],
  "words": [
    { "start": 0.0,  "end": 0.32, "text": "Good" },
    { "start": 0.32, "end": 0.72, "text": "morning" },
    { "start": 0.8,  "end": 1.28, "text": "everyone." }
  ]
}

(json{"text": "..."}; text → plain text body.)


4. Diarization

Two endpoints, both synchronous, both returning anonymous clusters (Speaker_0, Speaker_1, …) with timestamps. NVIDIA Sortformer owns segmentation; TitaNet produces a voiceprint per speaker.

POST /api/audio/diarize-chunk — segmentation + voice fingerprints

Multipart: file (required). Designed to be called per chunk for long calls; returns a 192-dim fingerprint per local speaker so you can re-cluster the same person across chunks.

curl -k -X POST https://<host>/api/audio/diarize-chunk -F "file=@call.wav"

Real response:

{
  "duration": 9.259,
  "segments": [
    { "start_s": 0.0,  "end_s": 1.52, "speaker": "Speaker_0", "confidence": 0.931 },
    { "start_s": 1.6,  "end_s": 4.56, "speaker": "Speaker_0", "confidence": 0.9662 },
    { "start_s": 4.88, "end_s": 9.04, "speaker": "Speaker_1", "confidence": 0.9681 }
  ],
  "speakers_detected": ["Speaker_0", "Speaker_1"],
  "fingerprints": {
    "Speaker_0": [0.0028, 0.0173, -0.0114, "...192 floats"],
    "Speaker_1": [0.0020, 0.0056, -0.0045, "...192 floats"]
  },
  "models": {
    "diarization": "nvidia/diar_sortformer_4spk-v1",
    "embedding": "nvidia/speakerverification_en_titanet_large"
  }
}
  • confidence ∈ [0,1] = mean probability the assigned speaker was active over the segment (threshold it to render uncertain segments as "Speaker_0?").
  • speaker labels are local to this chunk — use the fingerprints + cosine similarity (NeMo default distance threshold ~0.7) to merge chunkA.Speaker_0 with chunkB.Speaker_2 when they're the same voice.

POST /api/audio/transcribe-with-speakers — ASR + diarization merged

Multipart: file (required). Runs transcription + diarization and merges by timestamp into speaker-attributed text blocks (start_ms/end_ms).

{
  "duration": 9.259, "language": "en",
  "speakers_detected": ["Speaker_0", "Speaker_1"],
  "segments": [
    { "start_ms": 0,    "end_ms": 4480, "speaker": "Speaker_0",
      "text": "Good morning everyone. I think the energy thesis is strong this quarter." },
    { "start_ms": 4800, "end_ms": 9040, "speaker": "Speaker_1",
      "text": "I agree, but I am worried about the lockup terms and the fee load this time." }
  ],
  "models": { "transcription": "parakeet", "diarization": "nvidia/diar_sortformer_4spk-v1" }
}

Your key question — can it accept a prior (named timeline) to label clusters?

Yes — that's exactly what /api/audio/label-merge does (§4.5 below). You POST the audio plus your visual (start, end, name) timeline; the backend diarizes, runs the majority-temporal-overlap vote, and returns named segments — no client-side merge needed. (The two endpoints above still return anonymous clusters if you'd rather do the merge yourself.) Note the diarizer is fixed at ≤4 speakers per chunk (NVIDIA Sortformer, not pyannote) and takes no num_speakers hint — for >4-person calls, chunk the audio; your visual timeline actually helps disambiguate across chunks.

The fingerprints compound. Every named cluster comes back with its 192-dim voiceprint. Persist those (keyed by name) and pass them back as known_voiceprints on the next call — label-merge will recover a speaker by voice even when the visual cue is missing (camera off, a bad OCR frame). Your visual capture enrolls a voice library for free.


4.5 POST /api/audio/label-merge — named segments from a visual timeline

Diarize + name the clusters from your screen-derived timeline (majority temporal overlap), with an optional voiceprint fallback for anyone the visual track missed. Synchronous. Stateless — you own the timeline and the voiceprint library; the backend just diarizes + merges and persists nothing.

multipart/form-data fields:

field required notes
file yes mixed-mono audio
timeline yes JSON array: [{"start":0.0,"end":4.5,"name":"Alice","confidence":0.9}, ...] (seconds)
known_voiceprints no JSON object {"Alice":[192 floats], "Bob":[...]} — named voiceprints from past calls, used to label clusters with no visual overlap
transcribe no "true" to also return per-segment text (default false)
min_overlap no min fraction of a cluster's time that must overlap the winning name (default 0.0 = any overlap wins)
voiceprint_threshold no cosine similarity to accept a voiceprint match (default 0.5)
curl -k -X POST https://<host>/api/audio/label-merge \
  -F "file=@call.wav" \
  -F 'timeline=[{"start":0,"end":4.5,"name":"Alice"},{"start":4.8,"end":9.3,"name":"Bob"}]' \
  -F "transcribe=true"

Real response (the 2-speaker test clip; visual named both, with transcript):

{
  "duration": 9.259,
  "speakers": [
    { "cluster": "Speaker_0", "name": "Alice", "source": "visual", "overlap_confidence": 0.9866,
      "fingerprint": [0.0028, 0.0173, "...192 floats"] },
    { "cluster": "Speaker_1", "name": "Bob",   "source": "visual", "overlap_confidence": 1.0,
      "fingerprint": [0.0020, 0.0056, "...192 floats"] }
  ],
  "segments": [
    { "start_ms": 0,    "end_ms": 4480, "speaker": "Alice", "text": "Good morning everyone. I think the energy thesis is strong this quarter." },
    { "start_ms": 4800, "end_ms": 9040, "speaker": "Bob",   "text": "I agree, but I am worried about the lockup terms and the fee load this time." }
  ],
  "fingerprints": { "Alice": [192 floats], "Bob": [192 floats] },
  "models": { "diarization": "nvidia/diar_sortformer_4spk-v1", "embedding": "nvidia/speakerverification_en_titanet_large" }
}

Name resolution per cluster, in order: (1) the visual-timeline name with the most temporal overlap (source: "visual"); (2) if none, the closest known_voiceprints match above voiceprint_threshold (source: "voiceprint", with match_similarity); (3) otherwise Unknown_N (source: "unmatched") — never mislabeled. The fingerprints map (keyed by the resolved name) is what you persist to grow your voiceprint library for the next call. When transcribe=false, segments are {start_s, end_s, speaker, confidence} instead of text blocks.

Verified live — visual match (both speakers named), voiceprint recovery (a camera-off speaker matched by voice), and unmatched (→ Unknown_0) all confirmed.

If you capture two sample-aligned tracks — mic_file (the local user) + system_file (everyone else, from screen capture) — send them instead of file. This is strictly better than mixing to mono: the diarizer over-segments a mono mix (a stereo clip of two clean voices comes back as 3 speakers), whereas the two channels let each model get the easiest possible mono input.

Extra form fields for dual mode:

field required notes
mic_file + system_file yes (dual) the two aligned mono-16k tracks
self_name no the local user's name (mic channel). Default "Me".
self_vad no JSON [{"start","end"}] — windows where the mic is active and louder than system. If omitted, computed server-side per-window.

How it works: the mic track → your words, gated to windows where the mic is genuinely you speaking (the mic also picks up the remote audio as quiet bleed, so this loudness gate is essential — without it the bleed gets transcribed as you). The system track → diarized (it only has to separate the remote people) and named via the timeline + voiceprints. Your clean voiceprint is enrolled from the mic track and injected into the library, so a system cluster that's you dialed in from a second device (dual-login) resolves to you, not a stranger. You also free a Sortformer speaker slot (you no longer consume one).

curl -k -X POST https://<host>/api/audio/label-merge \
  -F "mic_file=@mic.wav" -F "system_file=@system.wav" \
  -F "self_name=Alice" -F 'timeline=[...]' -F "transcribe=true" \
  -F 'known_voiceprints={"Alice":[...],"Bob":[...]}'   # include your own

Response is the same shape with "mode":"dual_channel"; speakers includes a {"name":self_name,"source":"mic_channel"} entry, and fingerprints[self_name] is your clean mic-enrolled voiceprint to store.

Validated on a real misattributing call: dual-channel fixed both mono-mix misattributions (a remote "Go Bitcoin" no longer credited to the user; a local "There" recovered from Unknown), and correctly split overlapping speech — two people saying "Hello" at once that the coarse ground truth itself had conflated.

One known limit: if loud remote bleed masks a quiet local word, the mic-track ASR can miss it entirely (we can't attribute a word that was never transcribed). A cleaner mic (headphones, so there's no speaker bleed) avoids it; channel-subtraction echo-cancellation is a possible future enhancement since the tracks are sample-aligned.


5. Anything else

  • No OpenAPI/Swagger yet. This doc + the curl examples are the contract.
  • Health / discovery:
    • GET /api/status — per-service health (parakeet, etc.).
    • GET /api/endpoints — service-discovery JSON (base URLs + ready flags).
    • GET /v1/models — lists the STT model + diarizer.
  • Errors: JSON body, conventional status codes — 400 malformed, 413 too large, 503 + Retry-After if the backend briefly wedges (retry after the interval; transcription auto-recovers). Most error bodies are {"detail": "..."}.
  • Long calls: chunk into ~23 min pieces, send sequentially, diarize each with /api/audio/diarize-chunk, and stitch speakers across chunks via the fingerprints. (The operator's other apps use exactly this pattern.)

Backend: NVIDIA Parakeet TDT 0.6B (STT) + Sortformer 4spk-v1 (diarization) + TitaNet (voice fingerprints) on DGX Spark, fronted by Spark Control. All on the operator's LAN — nothing leaves the box.