Files
Keysat 8d839e3714 v0.13.0:4 - redaction gateway, embeddings proxy, expanded audio API
- Add redaction gateway (redaction_gateway.py, redaction/ scrub + tests)
- Add embeddings proxy and spark_embed service (Dockerfile + main.py)
- Expand audio_proxy with speaker-aware handling; deep_health/health/server updates
- Package: configureSparks action + sparkConfig model updates, manifest/main wiring
- Docs: AUDIO_API, EMBEDDINGS, REDACTION_GATEWAY; HANDOFF and runbook/known-issues refresh
2026-06-11 17:45:57 -05:00

289 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Spark Control — Audio API reference (transcription + diarization)
For external clients (e.g. the **Ten31 Transcripts** capture app) integrating with
the transcription and diarization endpoints. All examples are **real responses**
from the live deployment.
---
## 1. Connection / auth
- **Base URL:** `https://<spark-control-host>` (the operator's Start9 LAN address,
e.g. `https://<spark-control-host>:62419`). A `.local` form also exists (survives IP
changes); the operator can provide it.
- **TLS:** Start9's self-signed Root CA. On the LAN, set `verify=False` /
`rejectUnauthorized:false` (curl `-k`), or install the Start9 Root CA into your
trust store. Same story as every other Spark Control endpoint.
- **Auth:** **none on the LAN** today — the endpoints sit behind StartOS access
control + TLS on a trusted network. No bearer token / API key. (If you need
per-client auth later, the operator can add it; it's not there now.)
- **Limits:**
- Max upload: **200 MB** per request (`413` if exceeded). For long calls, chunk
(see §4).
- Request timeout: transcription ~300 s, diarization ~600 s per request.
- **Send requests sequentially, not in parallel.** Concurrent audio requests can
trip a GPU FFT race on the backend (returns `503` + `Retry-After`). One in
flight at a time is the safe pattern.
---
## 2. Audio format
- The backend (NVIDIA Parakeet / NeMo) **decodes common formats** (WAV, FLAC, MP3,
m4a) and internally resamples to **16 kHz mono**. So **16 kHz mono WAV is ideal**;
anything decodable also works (stereo is downmixed).
- **Single mixed-mono file** is what the endpoints expect. Diarization (Sortformer)
separates speakers *from one mixed stream*, so for diarization you want everyone
in one file — **mix your system-audio track + mic track to one mono WAV** before
sending.
- **Your two-track capture is an asset:** your mic track is, by definition, *you*
a known identity. Two clean options:
1. **Mix both tracks → diarize the mix** (simplest; pairs perfectly with your
visual-timeline name-merge — see §4 note).
2. **Diarize only the system track** (the other participants) and label your own
mic track as the user directly (no diarization needed for your own voice).
- **Upload mechanism:** `multipart/form-data`, file field name **`file`** (OpenAI-
compatible). Not base64, not a path/URL — send the bytes.
---
## 3. Transcription endpoint
**`POST /v1/audio/transcriptions`** — OpenAI-compatible. **Synchronous** (returns the
result; no job/polling).
Multipart fields:
| field | required | notes |
|---|---|---|
| `file` | **yes** | the audio bytes |
| `model` | no | default `parakeet-tdt-0.6b-v3` (one STT model server-side; you don't need to pick) |
| `response_format` | no | `json` (default, just text) · `verbose_json` (timestamps) · `text` |
| `language` | no | default auto/en |
| `temperature`, `prompt` | no | passthrough |
```bash
curl -k -X POST https://<host>/v1/audio/transcriptions \
-F "file=@call.wav" -F "response_format=verbose_json"
```
**Real `verbose_json` response** — includes **word-level AND segment-level
timestamps** (seconds):
```json
{
"task": "transcribe",
"language": "en",
"duration": 9.259,
"text": "Good morning everyone. I think the energy thesis is strong this quarter. I agree, but I am worried about the lockup terms and the fee load this time.",
"segments": [
{ "start": 0.0, "end": 1.28, "text": "Good morning everyone." },
{ "start": 1.44, "end": 4.48, "text": "I think the energy thesis is strong this quarter." }
],
"words": [
{ "start": 0.0, "end": 0.32, "text": "Good" },
{ "start": 0.32, "end": 0.72, "text": "morning" },
{ "start": 0.8, "end": 1.28, "text": "everyone." }
]
}
```
(`json``{"text": "..."}`; `text` → plain text body.)
---
## 4. Diarization
Two endpoints, both **synchronous**, both returning **anonymous clusters**
(`Speaker_0`, `Speaker_1`, …) with timestamps. NVIDIA **Sortformer** owns
segmentation; **TitaNet** produces a voiceprint per speaker.
### `POST /api/audio/diarize-chunk` — segmentation + voice fingerprints
Multipart: `file` (required). Designed to be called **per chunk** for long calls;
returns a 192-dim fingerprint per local speaker so you can re-cluster the same
person across chunks.
```bash
curl -k -X POST https://<host>/api/audio/diarize-chunk -F "file=@call.wav"
```
**Real response:**
```json
{
"duration": 9.259,
"segments": [
{ "start_s": 0.0, "end_s": 1.52, "speaker": "Speaker_0", "confidence": 0.931 },
{ "start_s": 1.6, "end_s": 4.56, "speaker": "Speaker_0", "confidence": 0.9662 },
{ "start_s": 4.88, "end_s": 9.04, "speaker": "Speaker_1", "confidence": 0.9681 }
],
"speakers_detected": ["Speaker_0", "Speaker_1"],
"fingerprints": {
"Speaker_0": [0.0028, 0.0173, -0.0114, "...192 floats"],
"Speaker_1": [0.0020, 0.0056, -0.0045, "...192 floats"]
},
"models": {
"diarization": "nvidia/diar_sortformer_4spk-v1",
"embedding": "nvidia/speakerverification_en_titanet_large"
}
}
```
- `confidence` ∈ [0,1] = mean probability the assigned speaker was active over the
segment (threshold it to render uncertain segments as "Speaker_0?").
- `speaker` labels are **local to this chunk** — use the fingerprints + cosine
similarity (NeMo default distance threshold ~0.7) to merge `chunkA.Speaker_0` with
`chunkB.Speaker_2` when they're the same voice.
### `POST /api/audio/transcribe-with-speakers` — ASR + diarization merged
Multipart: `file` (required). Runs transcription + diarization and **merges by
timestamp** into speaker-attributed text blocks (`start_ms`/`end_ms`).
```json
{
"duration": 9.259, "language": "en",
"speakers_detected": ["Speaker_0", "Speaker_1"],
"segments": [
{ "start_ms": 0, "end_ms": 4480, "speaker": "Speaker_0",
"text": "Good morning everyone. I think the energy thesis is strong this quarter." },
{ "start_ms": 4800, "end_ms": 9040, "speaker": "Speaker_1",
"text": "I agree, but I am worried about the lockup terms and the fee load this time." }
],
"models": { "transcription": "parakeet", "diarization": "nvidia/diar_sortformer_4spk-v1" }
}
```
### Your key question — can it accept a *prior* (named timeline) to label clusters?
**Yes — that's exactly what `/api/audio/label-merge` does (§4.5 below).** You POST
the audio plus your visual `(start, end, name)` timeline; the backend diarizes,
runs the majority-temporal-overlap vote, and returns **named** segments — no
client-side merge needed. (The two endpoints above still return anonymous clusters
if you'd rather do the merge yourself.) Note the diarizer is fixed at **≤4 speakers
per chunk** (NVIDIA Sortformer, not pyannote) and takes **no `num_speakers` hint**
for >4-person calls, chunk the audio; your visual timeline actually helps
disambiguate across chunks.
**The fingerprints compound.** Every named cluster comes back with its 192-dim
voiceprint. Persist those (keyed by name) and pass them back as `known_voiceprints`
on the next call — `label-merge` will recover a speaker by **voice** even when the
visual cue is missing (camera off, a bad OCR frame). Your visual capture *enrolls a
voice library for free*.
---
## 4.5 `POST /api/audio/label-merge` — named segments from a visual timeline
Diarize + **name the clusters** from your screen-derived timeline (majority temporal
overlap), with an optional **voiceprint fallback** for anyone the visual track
missed. **Synchronous. Stateless** — you own the timeline and the voiceprint
library; the backend just diarizes + merges and persists nothing.
`multipart/form-data` fields:
| field | required | notes |
|---|---|---|
| `file` | **yes** | mixed-mono audio |
| `timeline` | **yes** | JSON array: `[{"start":0.0,"end":4.5,"name":"Alice","confidence":0.9}, ...]` (seconds) |
| `known_voiceprints` | no | JSON object `{"Alice":[192 floats], "Bob":[...]}` — named voiceprints from past calls, used to label clusters with no visual overlap |
| `transcribe` | no | `"true"` to also return per-segment text (default false) |
| `min_overlap` | no | min fraction of a cluster's time that must overlap the winning name (default `0.0` = any overlap wins) |
| `voiceprint_threshold` | no | cosine similarity to accept a voiceprint match (default `0.5`) |
```bash
curl -k -X POST https://<host>/api/audio/label-merge \
-F "file=@call.wav" \
-F 'timeline=[{"start":0,"end":4.5,"name":"Alice"},{"start":4.8,"end":9.3,"name":"Bob"}]' \
-F "transcribe=true"
```
**Real response** (the 2-speaker test clip; visual named both, with transcript):
```json
{
"duration": 9.259,
"speakers": [
{ "cluster": "Speaker_0", "name": "Alice", "source": "visual", "overlap_confidence": 0.9866,
"fingerprint": [0.0028, 0.0173, "...192 floats"] },
{ "cluster": "Speaker_1", "name": "Bob", "source": "visual", "overlap_confidence": 1.0,
"fingerprint": [0.0020, 0.0056, "...192 floats"] }
],
"segments": [
{ "start_ms": 0, "end_ms": 4480, "speaker": "Alice", "text": "Good morning everyone. I think the energy thesis is strong this quarter." },
{ "start_ms": 4800, "end_ms": 9040, "speaker": "Bob", "text": "I agree, but I am worried about the lockup terms and the fee load this time." }
],
"fingerprints": { "Alice": [192 floats], "Bob": [192 floats] },
"models": { "diarization": "nvidia/diar_sortformer_4spk-v1", "embedding": "nvidia/speakerverification_en_titanet_large" }
}
```
**Name resolution per cluster, in order:** (1) the visual-timeline name with the
most temporal overlap (`source: "visual"`); (2) if none, the closest
`known_voiceprints` match above `voiceprint_threshold` (`source: "voiceprint"`,
with `match_similarity`); (3) otherwise `Unknown_N` (`source: "unmatched"`) — never
mislabeled. The `fingerprints` map (keyed by the resolved name) is what you persist
to grow your voiceprint library for the next call. When `transcribe=false`, segments
are `{start_s, end_s, speaker, confidence}` instead of text blocks.
**Verified live** — visual match (both speakers named), voiceprint recovery (a
camera-off speaker matched by voice), and unmatched (→ `Unknown_0`) all confirmed.
### Dual-channel mode (recommended for Ten31 Transcripts)
If you capture two sample-aligned tracks — **`mic_file`** (the local user) + **`system_file`**
(everyone else, from screen capture) — send them *instead of* `file`. This is strictly
better than mixing to mono: the diarizer over-segments a mono mix (a stereo clip of two
clean voices comes back as **3** speakers), whereas the two channels let each model get
the easiest possible mono input.
Extra form fields for dual mode:
| field | required | notes |
|---|---|---|
| `mic_file` + `system_file` | **yes (dual)** | the two aligned mono-16k tracks |
| `self_name` | no | the local user's name (mic channel). Default `"Me"`. |
| `self_vad` | no | JSON `[{"start","end"}]` — windows where the mic is active *and louder than* system. If omitted, computed server-side per-window. |
How it works: the **mic track** → your words, gated to windows where the mic is genuinely
you speaking (the mic also picks up the remote audio as quiet bleed, so this loudness gate
is essential — without it the bleed gets transcribed as you). The **system track**
diarized (it only has to separate the *remote* people) and named via the timeline +
voiceprints. Your clean voiceprint is **enrolled from the mic track** and injected into the
library, so a system cluster that's you dialed in from a second device (dual-login) resolves
to you, not a stranger. You also free a Sortformer speaker slot (you no longer consume one).
```bash
curl -k -X POST https://<host>/api/audio/label-merge \
-F "mic_file=@mic.wav" -F "system_file=@system.wav" \
-F "self_name=Alice" -F 'timeline=[...]' -F "transcribe=true" \
-F 'known_voiceprints={"Alice":[...],"Bob":[...]}' # include your own
```
Response is the same shape with `"mode":"dual_channel"`; `speakers` includes a
`{"name":self_name,"source":"mic_channel"}` entry, and `fingerprints[self_name]` is your
clean mic-enrolled voiceprint to store.
**Validated on a real misattributing call:** dual-channel fixed both mono-mix
misattributions (a remote "Go Bitcoin" no longer credited to the user; a local "There"
recovered from `Unknown`), and **correctly split overlapping speech** — two people saying
"Hello" at once that the coarse ground truth itself had conflated.
> **One known limit:** if *loud* remote bleed masks a *quiet* local word, the mic-track ASR
> can miss it entirely (we can't attribute a word that was never transcribed). A cleaner mic
> (headphones, so there's no speaker bleed) avoids it; channel-subtraction echo-cancellation
> is a possible future enhancement since the tracks are sample-aligned.
---
## 5. Anything else
- **No OpenAPI/Swagger yet.** This doc + the curl examples are the contract.
- **Health / discovery:**
- `GET /api/status` — per-service health (`parakeet`, etc.).
- `GET /api/endpoints` — service-discovery JSON (base URLs + ready flags).
- `GET /v1/models` — lists the STT model + diarizer.
- **Errors:** JSON body, conventional status codes — `400` malformed, `413` too
large, `503` + `Retry-After` if the backend briefly wedges (retry after the
interval; transcription auto-recovers). Most error bodies are `{"detail": "..."}`.
- **Long calls:** chunk into ~23 min pieces, send **sequentially**, diarize each
with `/api/audio/diarize-chunk`, and stitch speakers across chunks via the
fingerprints. (The operator's other apps use exactly this pattern.)
---
*Backend: NVIDIA Parakeet TDT 0.6B (STT) + Sortformer 4spk-v1 (diarization) +
TitaNet (voice fingerprints) on DGX Spark, fronted by Spark Control. All on the
operator's LAN — nothing leaves the box.*