v0.9.0:2 - audio proxy: turn Parakeet wedge 500 into clean 503 + immediate auto-restart

Parakeet's recurring CUDA wedge (CUBLAS_STATUS_*_ERROR mid-attention)
fires reliably on Open WebUI's WebM/Opus->MP3 audio. Previously the
proxy relayed the upstream 500 verbatim, Open WebUI showed "Server
connection error" with no signal to retry, and recovery took up to
5 minutes (waiting for the next periodic deep-health probe).

Now the proxy:
  1. Detects 500 from /v1/audio/transcriptions
  2. Fires deep_health.run_one("parakeet") as a background asyncio task
     (which contains the same wedge-detect + rate-limited auto-restart
     logic, but runs immediately instead of waiting for the next tick)
  3. Returns 503 with a clear detail message and Retry-After: 60

The client (Open WebUI, Home Assistant, etc.) gets a proper retry
signal; the auto-restart triggers inside seconds; the next attempt
~60s later succeeds. Rate-limiting (3 restarts per 30 min) is
inherited from the deep-health module so this can't cause restart
storms.

server.py: pass deep_health into build_audio_router().
audio_proxy.py: new 503-with-restart branch; signature now accepts
                deep_health as an optional dependency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Keysat
2026-05-17 18:07:35 -05:00
parent b37d7e998b
commit 197655a62b
3 changed files with 36 additions and 5 deletions
+31 -2
View File
@@ -8,10 +8,16 @@ Endpoints exposed on spark-control's port (same as the dashboard):
Both downstream services already speak HTTP on the LAN; this module just adapts
request/response shapes so OpenAI clients don't need a custom integration.
When Parakeet returns a 500 (commonly the recurring CUDA wedge), the proxy
returns a clearer 503 with Retry-After=60, and fires the deep-health probe in
the background — which detects the wedge and triggers a rate-limited container
restart inside seconds. The client's next attempt ~60s later then succeeds.
"""
from __future__ import annotations
import asyncio
import logging
from typing import Optional
from typing import Any, Optional
import httpx
from fastapi import APIRouter, Form, HTTPException, Request, UploadFile, File
@@ -56,7 +62,13 @@ class SpeechRequest(BaseModel):
encoding: Optional[str] = "LINEAR_PCM"
def build_router(settings: Settings) -> APIRouter:
def build_router(settings: Settings, deep_health: Any = None) -> APIRouter:
"""Build the audio proxy router.
If `deep_health` is provided, 500s from Parakeet trigger an immediate
background probe (which contains the same wedge-detect → auto-restart
logic as the 5-minute periodic loop, but fires now instead of waiting).
"""
router = APIRouter()
def _parakeet_base() -> str:
@@ -176,6 +188,23 @@ def build_router(settings: Settings) -> APIRouter:
except httpx.HTTPError as e:
raise HTTPException(502, f"parakeet unreachable: {e}")
if r.status_code == 500:
# Parakeet 500s are almost always the CUDA wedge (CUBLAS_*_ERROR
# mid-attention). Kick deep-health to detect+restart in the
# background, and return a clean retry signal to the client.
err_snippet = r.text[:400]
logger.warning("parakeet 500 — firing deep-health probe in background. detail=%s", err_snippet)
if deep_health is not None:
try:
asyncio.create_task(deep_health.run_one("parakeet"))
except Exception as e:
logger.error("failed to schedule deep-health probe: %s", e)
raise HTTPException(
status_code=503,
detail="Parakeet returned a transient error (likely CUDA wedge). Auto-restart triggered; retry in ~60s.",
headers={"Retry-After": "60"},
)
if r.status_code != 200:
raise HTTPException(r.status_code, r.text[:500])
return Response(content=r.content, media_type=r.headers.get("content-type", "application/json"))