v0.13.0:3 - proxy /v1/chat/completions through Spark Control to vLLM
Recap Relay dev caught that all audio endpoints route through Spark
Control but chat-completions didn't — clients had to know about both
SC AND the direct vLLM URL on Spark 1. Closes that last gap.
New endpoints:
POST /v1/chat/completions — OpenAI-shape, forwards to vLLM on Spark 1
POST /v1/completions — legacy OpenAI completions, same path
Implementation (image/app/llm_proxy.py):
- Dumb forwarder: request body passed through verbatim, response body
streamed back chunk-by-chunk. No transformation. vLLM already speaks
the same shape; adding any logic here would just create skew.
- Streaming: parses body for `stream: true` and uses httpx.AsyncClient
.stream() + FastAPI StreamingResponse if so. Non-streaming path is
a simple post-and-return.
- 30-minute timeout to accommodate large-context completions (default
httpx 5s would kill anything substantial).
- On upstream non-200 in streaming mode: emits one SSE `error` event
so the client's parser doesn't hang on an empty stream forever.
- On upstream connection error: HTTP 502 with "vllm unreachable" detail.
Now clients can use ONE host for everything:
POST https://spark-control/api/audio/diarize-chunk
POST https://spark-control/v1/audio/transcriptions
POST https://spark-control/v1/chat/completions
GET https://spark-control/api/endpoints (still works for clients that
prefer the direct URLs)
No parakeet container changes. No Reapply patches needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,10 +1,10 @@
|
||||
import { VersionInfo, IMPOSSIBLE } from '@start9labs/start-sdk'
|
||||
|
||||
export const v0_1_0 = VersionInfo.of({
|
||||
version: '0.13.0:2',
|
||||
version: '0.13.0:3',
|
||||
releaseNotes: {
|
||||
en_US:
|
||||
'v0.13.0:2 — per-segment confidence in diarize-chunk. Sortformer outputs per-frame per-speaker sigmoid probabilities (~12.6 fps) that we previously discarded. Now: for each diarization segment, compute mean probability of the assigned speaker across the segment\'s frames → confidence in [0, 1]. Recap Relay (and other consumers) can threshold this to render uncertain segments as "Speaker_0?" with a question mark, or to skip them entirely. Endpoint shape is otherwise unchanged — segments[].confidence is a new field, value may be None on derivation failure. Click Reapply patches on the Speech Models card after install to pick up the updated diarizer.py + main.py.',
|
||||
'v0.13.0:3 — chat-completions proxy. Adds POST /v1/chat/completions (and /v1/completions for the legacy endpoint) to Spark Control that forwards to whichever vLLM is currently loaded on Spark 1. Supports SSE streaming when stream=true in the request body. Request body is passed through unchanged — any vLLM-supported field works (model, messages, max_tokens, temperature, response_format, chat_template_kwargs, tools, ...). Closes the last gap that forced clients to know about both Spark Control AND the direct vLLM URL — recap-relay and friends can now use one trusted host for everything (transcribe, diarize, analyze) with one cert and one allowlist. 30-min request timeout to accommodate large-context completions. No parakeet container changes; no Reapply patches needed.',
|
||||
},
|
||||
migrations: {
|
||||
up: async ({ effects }) => {},
|
||||
|
||||
Reference in New Issue
Block a user