Files

T

Keysat 8d839e3714 v0.13.0:4 - redaction gateway, embeddings proxy, expanded audio API

- Add redaction gateway (redaction_gateway.py, redaction/ scrub + tests)
- Add embeddings proxy and spark_embed service (Dockerfile + main.py)
- Expand audio_proxy with speaker-aware handling; deep_health/health/server updates
- Package: configureSparks action + sparkConfig model updates, manifest/main wiring
- Docs: AUDIO_API, EMBEDDINGS, REDACTION_GATEWAY; HANDOFF and runbook/known-issues refresh

2026-06-11 17:45:57 -05:00

14 KiB

Raw Permalink Blame History

Spark Control — Audio API reference (transcription + diarization)

For external clients (e.g. the Ten31 Transcripts capture app) integrating with the transcription and diarization endpoints. All examples are real responses from the live deployment.

1. Connection / auth

Base URL: https://<spark-control-host> (the operator's Start9 LAN address, e.g. https://<spark-control-host>:62419). A .local form also exists (survives IP changes); the operator can provide it.
TLS: Start9's self-signed Root CA. On the LAN, set verify=False / rejectUnauthorized:false (curl -k), or install the Start9 Root CA into your trust store. Same story as every other Spark Control endpoint.
Auth: none on the LAN today — the endpoints sit behind StartOS access control + TLS on a trusted network. No bearer token / API key. (If you need per-client auth later, the operator can add it; it's not there now.)
Limits:
- Max upload: 200 MB per request (413 if exceeded). For long calls, chunk (see §4).
- Request timeout: transcription ~300 s, diarization ~600 s per request.
- Send requests sequentially, not in parallel. Concurrent audio requests can trip a GPU FFT race on the backend (returns 503 + Retry-After). One in flight at a time is the safe pattern.

2. Audio format

The backend (NVIDIA Parakeet / NeMo) decodes common formats (WAV, FLAC, MP3, m4a) and internally resamples to 16 kHz mono. So 16 kHz mono WAV is ideal; anything decodable also works (stereo is downmixed).
Single mixed-mono file is what the endpoints expect. Diarization (Sortformer) separates speakers from one mixed stream, so for diarization you want everyone in one file — mix your system-audio track + mic track to one mono WAV before sending.
Your two-track capture is an asset: your mic track is, by definition, you — a known identity. Two clean options:
1. Mix both tracks → diarize the mix (simplest; pairs perfectly with your visual-timeline name-merge — see §4 note).
2. Diarize only the system track (the other participants) and label your own mic track as the user directly (no diarization needed for your own voice).
Upload mechanism: multipart/form-data, file field name file (OpenAI- compatible). Not base64, not a path/URL — send the bytes.

3. Transcription endpoint

POST /v1/audio/transcriptions — OpenAI-compatible. Synchronous (returns the result; no job/polling).

Multipart fields:

field	required	notes
`file`	yes	the audio bytes
`model`	no	default `parakeet-tdt-0.6b-v3` (one STT model server-side; you don't need to pick)
`response_format`	no	`json` (default, just text) · `verbose_json` (timestamps) · `text`
`language`	no	default auto/en
`temperature`, `prompt`	no	passthrough

curl -k -X POST https://<host>/v1/audio/transcriptions \
  -F "file=@call.wav" -F "response_format=verbose_json"

Real verbose_json response — includes word-level AND segment-level timestamps (seconds):

{
  "task": "transcribe",
  "language": "en",
  "duration": 9.259,
  "text": "Good morning everyone. I think the energy thesis is strong this quarter. I agree, but I am worried about the lockup terms and the fee load this time.",
  "segments": [
    { "start": 0.0,  "end": 1.28, "text": "Good morning everyone." },
    { "start": 1.44, "end": 4.48, "text": "I think the energy thesis is strong this quarter." }
  ],
  "words": [
    { "start": 0.0,  "end": 0.32, "text": "Good" },
    { "start": 0.32, "end": 0.72, "text": "morning" },
    { "start": 0.8,  "end": 1.28, "text": "everyone." }
  ]
}

(json → {"text": "..."}; text → plain text body.)

4. Diarization

Two endpoints, both synchronous, both returning anonymous clusters (Speaker_0, Speaker_1, …) with timestamps. NVIDIA Sortformer owns segmentation; TitaNet produces a voiceprint per speaker.

`POST /api/audio/diarize-chunk` — segmentation + voice fingerprints

Multipart: file (required). Designed to be called per chunk for long calls; returns a 192-dim fingerprint per local speaker so you can re-cluster the same person across chunks.

curl -k -X POST https://<host>/api/audio/diarize-chunk -F "file=@call.wav"

Real response:

{
  "duration": 9.259,
  "segments": [
    { "start_s": 0.0,  "end_s": 1.52, "speaker": "Speaker_0", "confidence": 0.931 },
    { "start_s": 1.6,  "end_s": 4.56, "speaker": "Speaker_0", "confidence": 0.9662 },
    { "start_s": 4.88, "end_s": 9.04, "speaker": "Speaker_1", "confidence": 0.9681 }
  ],
  "speakers_detected": ["Speaker_0", "Speaker_1"],
  "fingerprints": {
    "Speaker_0": [0.0028, 0.0173, -0.0114, "...192 floats"],
    "Speaker_1": [0.0020, 0.0056, -0.0045, "...192 floats"]
  },
  "models": {
    "diarization": "nvidia/diar_sortformer_4spk-v1",
    "embedding": "nvidia/speakerverification_en_titanet_large"
  }
}

confidence ∈ [0,1] = mean probability the assigned speaker was active over the segment (threshold it to render uncertain segments as "Speaker_0?").
speaker labels are local to this chunk — use the fingerprints + cosine similarity (NeMo default distance threshold ~0.7) to merge chunkA.Speaker_0 with chunkB.Speaker_2 when they're the same voice.

`POST /api/audio/transcribe-with-speakers` — ASR + diarization merged

Multipart: file (required). Runs transcription + diarization and merges by timestamp into speaker-attributed text blocks (start_ms/end_ms).

{
  "duration": 9.259, "language": "en",
  "speakers_detected": ["Speaker_0", "Speaker_1"],
  "segments": [
    { "start_ms": 0,    "end_ms": 4480, "speaker": "Speaker_0",
      "text": "Good morning everyone. I think the energy thesis is strong this quarter." },
    { "start_ms": 4800, "end_ms": 9040, "speaker": "Speaker_1",
      "text": "I agree, but I am worried about the lockup terms and the fee load this time." }
  ],
  "models": { "transcription": "parakeet", "diarization": "nvidia/diar_sortformer_4spk-v1" }
}

Your key question — can it accept a prior (named timeline) to label clusters?

Yes — that's exactly what /api/audio/label-merge does (§4.5 below). You POST the audio plus your visual (start, end, name) timeline; the backend diarizes, runs the majority-temporal-overlap vote, and returns named segments — no client-side merge needed. (The two endpoints above still return anonymous clusters if you'd rather do the merge yourself.) Note the diarizer is fixed at ≤4 speakers per chunk (NVIDIA Sortformer, not pyannote) and takes no num_speakers hint — for >4-person calls, chunk the audio; your visual timeline actually helps disambiguate across chunks.

The fingerprints compound. Every named cluster comes back with its 192-dim voiceprint. Persist those (keyed by name) and pass them back as known_voiceprints on the next call — label-merge will recover a speaker by voice even when the visual cue is missing (camera off, a bad OCR frame). Your visual capture enrolls a voice library for free.

4.5 `POST /api/audio/label-merge` — named segments from a visual timeline

Diarize + name the clusters from your screen-derived timeline (majority temporal overlap), with an optional voiceprint fallback for anyone the visual track missed. Synchronous. Stateless — you own the timeline and the voiceprint library; the backend just diarizes + merges and persists nothing.

multipart/form-data fields:

field	required	notes
`file`	yes	mixed-mono audio
`timeline`	yes	JSON array: `[{"start":0.0,"end":4.5,"name":"Alice","confidence":0.9}, ...]` (seconds)
`known_voiceprints`	no	JSON object `{"Alice":[192 floats], "Bob":[...]}` — named voiceprints from past calls, used to label clusters with no visual overlap
`transcribe`	no	`"true"` to also return per-segment text (default false)
`min_overlap`	no	min fraction of a cluster's time that must overlap the winning name (default `0.0` = any overlap wins)
`voiceprint_threshold`	no	cosine similarity to accept a voiceprint match (default `0.5`)

curl -k -X POST https://<host>/api/audio/label-merge \
  -F "file=@call.wav" \
  -F 'timeline=[{"start":0,"end":4.5,"name":"Alice"},{"start":4.8,"end":9.3,"name":"Bob"}]' \
  -F "transcribe=true"

Real response (the 2-speaker test clip; visual named both, with transcript):

{
  "duration": 9.259,
  "speakers": [
    { "cluster": "Speaker_0", "name": "Alice", "source": "visual", "overlap_confidence": 0.9866,
      "fingerprint": [0.0028, 0.0173, "...192 floats"] },
    { "cluster": "Speaker_1", "name": "Bob",   "source": "visual", "overlap_confidence": 1.0,
      "fingerprint": [0.0020, 0.0056, "...192 floats"] }
  ],
  "segments": [
    { "start_ms": 0,    "end_ms": 4480, "speaker": "Alice", "text": "Good morning everyone. I think the energy thesis is strong this quarter." },
    { "start_ms": 4800, "end_ms": 9040, "speaker": "Bob",   "text": "I agree, but I am worried about the lockup terms and the fee load this time." }
  ],
  "fingerprints": { "Alice": [192 floats], "Bob": [192 floats] },
  "models": { "diarization": "nvidia/diar_sortformer_4spk-v1", "embedding": "nvidia/speakerverification_en_titanet_large" }
}

Name resolution per cluster, in order: (1) the visual-timeline name with the most temporal overlap (source: "visual"); (2) if none, the closest known_voiceprints match above voiceprint_threshold (source: "voiceprint", with match_similarity); (3) otherwise Unknown_N (source: "unmatched") — never mislabeled. The fingerprints map (keyed by the resolved name) is what you persist to grow your voiceprint library for the next call. When transcribe=false, segments are {start_s, end_s, speaker, confidence} instead of text blocks.

Verified live — visual match (both speakers named), voiceprint recovery (a camera-off speaker matched by voice), and unmatched (→ Unknown_0) all confirmed.

Dual-channel mode (recommended for Ten31 Transcripts)

If you capture two sample-aligned tracks — mic_file (the local user) + system_file (everyone else, from screen capture) — send them instead of file. This is strictly better than mixing to mono: the diarizer over-segments a mono mix (a stereo clip of two clean voices comes back as 3 speakers), whereas the two channels let each model get the easiest possible mono input.

Extra form fields for dual mode:

field	required	notes
`mic_file` + `system_file`	yes (dual)	the two aligned mono-16k tracks
`self_name`	no	the local user's name (mic channel). Default `"Me"`.
`self_vad`	no	JSON `[{"start","end"}]` — windows where the mic is active and louder than system. If omitted, computed server-side per-window.

How it works: the mic track → your words, gated to windows where the mic is genuinely you speaking (the mic also picks up the remote audio as quiet bleed, so this loudness gate is essential — without it the bleed gets transcribed as you). The system track → diarized (it only has to separate the remote people) and named via the timeline + voiceprints. Your clean voiceprint is enrolled from the mic track and injected into the library, so a system cluster that's you dialed in from a second device (dual-login) resolves to you, not a stranger. You also free a Sortformer speaker slot (you no longer consume one).

curl -k -X POST https://<host>/api/audio/label-merge \
  -F "mic_file=@mic.wav" -F "system_file=@system.wav" \
  -F "self_name=Alice" -F 'timeline=[...]' -F "transcribe=true" \
  -F 'known_voiceprints={"Alice":[...],"Bob":[...]}'   # include your own

Response is the same shape with "mode":"dual_channel"; speakers includes a {"name":self_name,"source":"mic_channel"} entry, and fingerprints[self_name] is your clean mic-enrolled voiceprint to store.

Validated on a real misattributing call: dual-channel fixed both mono-mix misattributions (a remote "Go Bitcoin" no longer credited to the user; a local "There" recovered from Unknown), and correctly split overlapping speech — two people saying "Hello" at once that the coarse ground truth itself had conflated.

One known limit: if loud remote bleed masks a quiet local word, the mic-track ASR can miss it entirely (we can't attribute a word that was never transcribed). A cleaner mic (headphones, so there's no speaker bleed) avoids it; channel-subtraction echo-cancellation is a possible future enhancement since the tracks are sample-aligned.

5. Anything else

No OpenAPI/Swagger yet. This doc + the curl examples are the contract.
Health / discovery:
- GET /api/status — per-service health (parakeet, etc.).
- GET /api/endpoints — service-discovery JSON (base URLs + ready flags).
- GET /v1/models — lists the STT model + diarizer.
Errors: JSON body, conventional status codes — 400 malformed, 413 too large, 503 + Retry-After if the backend briefly wedges (retry after the interval; transcription auto-recovers). Most error bodies are {"detail": "..."}.
Long calls: chunk into ~2–3 min pieces, send sequentially, diarize each with /api/audio/diarize-chunk, and stitch speakers across chunks via the fingerprints. (The operator's other apps use exactly this pattern.)

Backend: NVIDIA Parakeet TDT 0.6B (STT) + Sortformer 4spk-v1 (diarization) + TitaNet (voice fingerprints) on DGX Spark, fronted by Spark Control. All on the operator's LAN — nothing leaves the box.

14 KiB Raw Permalink Blame History Unescape Escape