Webhook DLQ — list failed deliveries and manually retry

Closes the silent-loss hole in outbound webhook delivery. The worker
in src/webhooks.rs retries failed deliveries with exponential backoff
up to 10 attempts, then sets next_attempt_at = NULL and walks away.
Pre-this-commit, those "dead-lettered" rows sat in webhook_deliveries
forever with no surface for the operator to discover, inspect, or
recover from them — a subscriber that was down for >6h during a
license-issuance burst would silently lose those events forever.

What's new:

- repo::DeliveryStatusFilter — enum with parse() so query strings
  map cleanly to SQL predicates.
- repo::list_deliveries — endpoint_id + status + limit, newest first.
- repo::requeue_delivery — resets attempt_count=0, clears delivered_at
  and last_error, sets next_attempt_at=now. The worker picks it up on
  the next 5s tick.

- src/api/webhook_deliveries.rs — admin module with two handlers:
  - GET /v1/admin/webhook-deliveries?endpoint_id=…&status=…&limit=…
  - POST /v1/admin/webhook-deliveries/:id/retry  (audit-logged as
    webhook_delivery.retry; 404 on missing id)
- Routes registered in src/api/mod.rs alongside the existing
  webhook_endpoints CRUD.

- tests/api.rs gains webhook_dlq_lists_failed_and_retry_requeues:
  seeds three deliveries directly via SQL (one each: delivered,
  pending, dead-lettered), exercises the list filter, runs the retry,
  asserts the row migrates from failed→pending, audit row is written,
  404 on bad id, 400 on bad status filter.

Worker code is unchanged. The DLQ is operator-actionable infrastructure
on top of the existing retry semantics.

Test count: 23 (9 unit + 4 migration + 10 API), up from 22.
This commit is contained in:
Grant
2026-05-08 09:38:58 -05:00
parent e2b296ce29
commit f9ef1a854c
4 changed files with 387 additions and 0 deletions
+11
View File
@@ -70,6 +70,7 @@ pub mod session_layer;
pub mod tier;
pub mod validate;
pub mod webhook;
pub mod webhook_deliveries;
pub mod webhook_endpoints;
use crate::btcpay::client::BtcpayClient;
@@ -304,6 +305,16 @@ pub fn router(state: AppState) -> Router {
"/v1/admin/webhook-endpoints/:id",
axum::routing::delete(webhook_endpoints::delete),
)
// Webhook delivery history (the dead-letter inspection +
// manual-retry surface; see webhook_deliveries.rs for why).
.route(
"/v1/admin/webhook-deliveries",
get(webhook_deliveries::list),
)
.route(
"/v1/admin/webhook-deliveries/:id/retry",
post(webhook_deliveries::retry),
)
// Discount / referral codes.
.route(
"/v1/admin/discount-codes",
@@ -0,0 +1,104 @@
//! Admin views over the outbound webhook delivery queue.
//!
//! Companion to `webhook_endpoints.rs`: that module manages the
//! configured subscriber URLs; this one exposes the row-level history
//! of attempts (success, in-flight retries, dead-lettered failures)
//! and lets operators manually re-queue a dead delivery for another
//! pass through the worker.
//!
//! Why this exists: the worker in `crate::webhooks` retries failed
//! deliveries with exponential backoff up to 10 attempts, then sets
//! `next_attempt_at = NULL` and walks away. Pre-this-module, those
//! "dead-lettered" rows were invisible — operators had no surface to
//! discover, inspect, or recover from them. A subscriber endpoint
//! that was down for >6h during a license-issuance burst would
//! silently lose those events forever.
use crate::api::admin::{request_context, require_admin};
use crate::api::AppState;
use crate::db::repo::{self, DeliveryStatusFilter};
use crate::error::{AppError, AppResult};
use axum::{
extract::{Path, Query, State},
http::HeaderMap,
Json,
};
use serde::Deserialize;
use serde_json::{json, Value};
const DEFAULT_LIMIT: i64 = 100;
const MAX_LIMIT: i64 = 500;
#[derive(Debug, Deserialize)]
pub struct ListDeliveriesQuery {
/// Filter by configured endpoint id. Omit for all endpoints.
pub endpoint_id: Option<String>,
/// One of `pending` | `delivered` | `failed` | `all`. Defaults to
/// `all`. The `failed` filter is the dead-letter queue — rows
/// where the worker exhausted retries.
pub status: Option<String>,
/// Cap on rows returned. Defaults to 100; max 500.
pub limit: Option<i64>,
}
pub async fn list(
State(state): State<AppState>,
headers: HeaderMap,
Query(q): Query<ListDeliveriesQuery>,
) -> AppResult<Json<Value>> {
require_admin(&state, &headers)?;
let status = match q.status.as_deref() {
Some(s) => DeliveryStatusFilter::parse(s).ok_or_else(|| {
AppError::BadRequest(format!(
"invalid status filter '{s}'; expected pending|delivered|failed|all"
))
})?,
None => DeliveryStatusFilter::All,
};
let limit = q
.limit
.unwrap_or(DEFAULT_LIMIT)
.clamp(1, MAX_LIMIT);
let rows = repo::list_deliveries(
&state.db,
q.endpoint_id.as_deref(),
status,
limit,
)
.await?;
Ok(Json(json!({ "deliveries": rows })))
}
/// Manual re-queue for a dead-lettered (or otherwise stuck)
/// delivery. The worker will pick it up on the next 5s tick.
///
/// 404 if the delivery id doesn't exist; 200 on success with the
/// updated row in the body so the SPA can re-render the list with
/// the new state immediately.
pub async fn retry(
State(state): State<AppState>,
headers: HeaderMap,
Path(id): Path<String>,
) -> AppResult<Json<Value>> {
let actor_hash = require_admin(&state, &headers)?;
let (ip, ua) = request_context(&headers);
let delivery = repo::requeue_delivery(&state.db, &id)
.await?
.ok_or_else(|| AppError::NotFound(format!("webhook delivery '{id}'")))?;
let _ = repo::insert_audit(
&state.db,
"admin_api_key",
Some(&actor_hash),
"webhook_delivery.retry",
Some("webhook_delivery"),
Some(&id),
ip.as_deref(),
ua.as_deref(),
&json!({
"endpoint_id": delivery.endpoint_id,
"event_type": delivery.event_type,
}),
)
.await;
Ok(Json(json!(delivery)))
}