// Post-cluster polish pass: after transcribe + diarize + clustering // have produced a speaker-labeled transcript, AND after pipelined // analyze has produced section objects (titles + summaries), run a // two-stage LLM pass that: // // Stage 1 — Global name inference. One LLM call with the // speaker-labeled transcript + episode metadata (channel name, // title, description) → JSON map { Speaker_A: "Matt Hill", // Speaker_B: "Sarah Jones", Speaker_C: null }. The "_C: null" // case is essential: when the LLM can't confidently identify a // speaker, it must return null instead of guessing. // // Stage 2 — Per-window summary polish. N parallel LLM calls, one // per analyze window. Each call sees that window's sections // (original summaries) + that window's transcript with speaker // labels + the global name map from Stage 1, and rewrites each // section's SUMMARY to attribute statements to specific // speakers ("Matt Hill explains..." vs "the discussion // centers..."). Section TITLES and start/end indices are kept // unchanged — polish only touches summary text. // // Why two stages: name inference benefits from the FULL transcript // view (name introductions like "welcome Matt" tend to appear in // window 1 but Matt keeps speaking throughout); per-window polish // benefits from parallelism (matches the existing analyze pattern). // Running them as one batched call would either lose parallelism // or send the full transcript N times. // // Failure modes: // - Stage 1 returns invalid JSON → all names default to null; // Stage 2 still runs and produces "Speaker A explains..." etc. // - Stage 2 fails for a particular window → keep the original // analyze summary for that window's sections. Per-window // failure shouldn't kill the whole polish. // - Both stages fail → fall back to the unpolished analyzeResult. // The caller sees the same output as a polish-disabled run. // // Cost: Stage 1 ~5-10s; Stage 2 ~10-15s (parallel); total ~15-25s // added to end of pipeline. On a 200s pipelined pipeline that's a // ~10% slowdown for the speaker-attribution UX win. import { recordCall } from "./audit-log.js"; const STAGE_1_MAX_ATTEMPTS = 3; const STAGE_2_MAX_ATTEMPTS = 3; // ─── Default prompts (operator-editable via Settings tab) ─────────── // // Same three-layer override pattern as the analyze + transcribe // prompts: per-session operator override → operator-promoted default // → these hardcoded defaults. Both are validated on save — // `DEFAULT_NAME_INFERENCE_PROMPT_TEMPLATE` must contain {{transcript}} // and JSON output instructions; `DEFAULT_SUMMARY_POLISH_PROMPT_TEMPLATE` // must contain {{sections}} and JSON output instructions. Template // variables (interpolated at request time): // // Name inference prompt: // {{channel}} — operator-supplied or yt-dlp-extracted channel name // {{title}} — episode/video title // {{description}} — episode description (capped at 800 chars) // {{speakerStats}} — pre-formatted block listing each speaker's // chip letter, total speaking time, turn count // {{transcript}} — speaker-labeled bracketed transcript, capped // at 25k chars (middle truncated when over) // {{speakerKeys}} — JSON-schema-friendly key list for the // response shape (one line per Speaker_X) // // Summary polish prompt: // {{speakerRoster}} — pre-formatted block listing each speaker // with their inferred name (or "(unknown)") // and stats // {{transcript}} — this window's slice of the labeled transcript // {{sections}} — pre-formatted block listing each section // with title + original summary + time range export const DEFAULT_NAME_INFERENCE_PROMPT_TEMPLATE = `You are identifying real-world speaker names in an interview/podcast/meeting transcript. The transcript below has been pre-tagged with speaker labels like [A], [B], [C] — these are anonymous labels assigned by voice clustering. Your job: infer the real names of each speaker from contextual clues in the transcript. EPISODE METADATA: - Channel/show: {{channel}} - Episode title: {{title}} - Description: {{description}} {{operatorContext}}SPEAKER STATISTICS (cluster output): {{speakerStats}} TRANSCRIPT (each line is "[ ] text"): {{transcript}} INSTRUCTIONS: 1. For each Speaker_X in the speaker statistics, infer the real name from contextual clues: - Direct introductions ("welcome Matt", "I'm joined by Sarah") - Self-introductions ("my name is", "I'm Sarah, founder of...") - References between speakers ("what do you think Matt?", "as Sarah was saying") - Channel name or episode title hints - Operator hints in the OPERATOR HINTS section above, IF PRESENT — but see rule 6 for how to weight those. 2. Use the speaker statistics to help — the host typically speaks more turns; guests speak less. 3. Use first + last name if confidently identifiable. Use first name only if that's all you have. 4. RETURN null IF YOU CANNOT CONFIDENTLY IDENTIFY THE SPEAKER. Do not guess. A null is better than a wrong name. 5. For brief speakers (under 30s of speaking time, e.g. an intro music VO or a passing comment) it's expected that you'll often return null. 6. WEIGHTING OPERATOR HINTS: When an OPERATOR HINTS section appears above, treat it as informed suggestion, NOT authoritative truth. The operator may have listed people who turned out not to speak, omitted people who did, or guessed wrong on who matches which voice. ALWAYS verify hints against the transcript. Specifically: - A name in the hints is only a candidate; if the transcript provides no signal that THIS Speaker_X is that person, return null instead of guessing. - If the transcript clearly identifies a speaker as someone NOT in the hints, use the transcript's name. - If the hints describe what each named person did ("Steve gave the update, John asked questions"), use that as a soft signal for mapping names to chip letters, but still verify with the transcript before committing. - It is better to leave a speaker as null than to confidently map a hint to the wrong chip letter. Respond with ONLY valid JSON in this exact format, no other text: { "speakers": { {{speakerKeys}} } }`; export const DEFAULT_SUMMARY_POLISH_PROMPT_TEMPLATE = `You wrote section summaries for a podcast/interview transcript window. We've now identified the speakers via voice clustering and (where possible) inferred their real names. Your job: rewrite each section's SUMMARY to attribute statements to specific speakers where it improves clarity, naturalness, and information density. SPEAKERS (from voice clustering across the full episode): {{speakerRoster}} WINDOW TRANSCRIPT (this window's slice; each line is labeled with the speaker's name, or a chip letter when their name is unknown): {{transcript}} ORIGINAL SECTIONS IN THIS WINDOW (re-write the summary of each): {{sections}} INSTRUCTIONS: 1. The SPEAKERS roster and the WINDOW TRANSCRIPT are the AUTHORITATIVE source of who said what. The ORIGINAL summaries were written in an earlier pass and may attribute statements to OUTDATED or WRONG speaker names — your job includes CORRECTING those. 2. Rewrite each section's SUMMARY so every speaker attribution matches the transcript + roster. If an original summary credits a statement to a person who, per the transcript, was actually said by someone else, REPLACE the name with the correct one. Never keep a name that does not appear in the roster. 3. Use real names when available ("Matt Hill explains..."); fall back to a chip letter only for a speaker who has no name ("Speaker A explains..."). 4. Keep summaries 1-3 sentences — same length range as the original. 5. KEEP THE TITLE EXACTLY AS GIVEN. Do not rewrite titles. 6. Return the sections in the SAME ORDER as given, with the SAME INDEX numbers in the array. 7. If a section is primarily one speaker, lean into their name ("Matt explains..."). If it is back-and-forth, name both ("Matt and Sarah debate..."). 8. If the transcript for a section genuinely has no speaker signal, keep the original summary's wording — but still fix or drop any name in it that conflicts with the roster. DO NOT invent attribution that the transcript does not support. Respond with ONLY valid JSON in this exact format: { "sections": [ { "index": 0, "summary": "Polished summary text..." }, { "index": 1, "summary": "..." } ] } Return only the sections in this window. Use the same indices as the input ([0], [1], ...). Only the summary field — title and indices stay as given.`; // Substitute {{key}} placeholders in a template. Unknown keys are // left as the literal {{key}} so an operator's edit that drops a // variable doesn't crash the run — the model just sees the placeholder. function fillTemplate(template, vars) { return String(template || "").replace(/\{\{\s*(\w+)\s*\}\}/g, (_match, key) => { return key in vars ? String(vars[key]) : `{{${key}}}`; }); } // Build a transcript representation with speaker labels prefixed. // Each line: `[A 0:08] So Matt, tell us how you got started…` // - The bracketed prefix is `[