Compare commits

..

22 Commits

Author SHA1 Message Date
Grant Gilliam b4bf88ed2c Chunk overlap + overlap-aware stitching
Chunks were contiguous (start = prev end) with a naïve offset-concat stitch — no
overlap. That cut sentences at boundaries, denied the diarizer context at edges, and
let one voice split across chunks (the MH/Unknown_0 problem). Now each ~150s body is
sliced with a 15s margin on both sides ([bodyStart-15, bodyEnd+15]); the stitcher
keeps a segment only in the chunk that owns its MIDPOINT (body region) and drops it
from the neighbour's margin — so boundary-spanning speech is seen whole by the
backend and kept exactly once.

- SessionPackager.PlannedChunk gains bodyStart/bodyEnd; planChunks adds overlapSeconds.
- TranscriptAssembler.ChunkResult carries body bounds (defaults keep-all → no-overlap
  behaviour preserved for existing callers); assemble dedups by midpoint-in-body.
- TranscriptPipeline passes body bounds through.

Complements (doesn't replace) the fragment-smoothing + reconciliation safety nets;
this is the upstream fix. ~+20% backend audio per interior chunk. 63/63 XCTest
(new: overlap window layout + boundary-segment dedup).
2026-06-08 13:03:56 -05:00
Grant Gilliam a25ef16a2a Fix mis-attributed fragments + LLM naming guardrails + re-process saved sessions
Investigating Grant's real 38-min group call: 'Marty' was a GARBAGE cluster (192
segs, 0.37s mean, 186 ≤2 words, 125 single words flanked by the same other speaker —
diarization micro-fragments split mid-sentence, then LLM-named 'Marty'). Same for
'Message'/'HI'.

- SpeakerReconciler.smoothFragments: dissolve non-self clusters whose MEDIAN segment
  duration ≤ 1s (≥3 segs) — reassign each fragment to the temporally-nearest real
  speaker. (Median, not max, so one stray long segment can't rescue a fragment
  cluster — the bug in the first cut.) On the real call: 7 speakers (3 junk) → 4 real
  (Marty/Message/HI absorbed into Grant/Jonathan/Me/MH). Runs before LLM naming.
- LLM naming guardrails: forbid assigning the self name or ANY already-taken name to
  another voice (fixes 'Grant' = the user's name pinned on a remote speaker); prompt
  demands self-intro / direct-address evidence (mention ≠ presence), 'precision over
  coverage', one name per speaker.
- Open saved session now offers Open Editor vs Re-process, so newer logic can be
  applied to past calls (+ always-visible progress from the prior fix).

NOTE: the self-name guardrail needs the app to KNOW the user's name — selfName is still
'Me', so set it in Settings (e.g. 'Grant') so the LLM can't reuse it. 62/62 XCTest.
2026-06-08 12:45:17 -05:00
Grant Gilliam c8e7a27150 Open saved session: visible progress + clear errors (no silent no-op)
The status line only rendered inside the last-in-memory-session block, so 'Open
saved session' processed invisibly — looked like nothing happened. Now: the
transcript status (with a spinner) is always shown, the processing(0,0) reconcile
phase reads 'Working… (this can take a few minutes)', and invalid picks surface an
alert (not a recorded session / already processing / unreadable transcript) instead
of doing nothing.
2026-06-08 12:16:52 -05:00
Grant Gilliam 3e3a1dea2e Ring-based speaker attribution (fixes real large-tile detection)
Real Teams/Signal frames exposed a geometry bug: estimating a tile's SIZE from its
name width (×3) produces a tiny box on big real tiles, so the speaking border ring
fell entirely outside it → zero points → 'not speaking' (Joe Payne's clear blue
border went undetected). Pure nearest-name fails too (the top edge of a lower tile
is closer to the upper tile's bottom-anchored name).

Fix: cluster the highlight pixels into connected RINGS (GridCallAnalyzer.connectedComponents,
spatial-hashed union-find), then attribute each ring to the OCR'd name inside its
bounding box. The ring *is* the tile, so detection is independent of tile-size
estimation, and multiple simultaneous borders (lag/persist/crosstalk) become separate
rings naturally — exactly the multi-ring case Grant flagged. minRingSpan rejects specks.

Validated on real frames: Teams now detects 'Joe Payne' (was empty); Signal detects
'JA' in the group grid. (Signal _002 has a border but no rendered name that frame —
inherent Signal intermittency; voice + reconciliation cover it.) 59/59 synthetic
XCTest still green (white + coloured, single + crosstalk).
2026-06-08 12:03:36 -05:00
Grant Gilliam 1e4ab30ab7 Speaker reconciliation + open/re-process any saved session
Reconciliation (the marry-the-signals layer): after transcription, before the recap,
SpeakerReconciler (1) MERGES non-self clusters whose voiceprints are highly similar
(cosine >= 0.82) — fixes a person split across chunks (the real 1-on-1 failure: one
remote came back as 'MH' + 'Unknown_0'); and (2) NAMES remaining non-self clusters
from transcript CONTENT via the gateway LLM (people addressed by name / self-intros),
conservative + confidence-gated, keeping the placeholder when unrevealed. The
mic-channel self is protected and never reassigned. Voice does the segmentation; the
fingerprint-merge fixes splits; the LLM adds the content signal visual/voiceprint lack.

- SpeakerReconciler: pure cosine merge (tested) + LLM content-naming pass; rewrites
  speakers.json before recap. SessionController.finishBackend shares one model lookup
  for reconcile + recap. Gated by settings.reconcileSpeakers (default on).
- Open saved session: menu 'Open saved session…' → folder picker. Edits it if already
  transcribed, else reconstructs inputs from disk (visual_timeline vision segs +
  channel self-spans) and runs transcribe → reconcile → recap, then opens the editor.
  Lets you evaluate/correct ANY past call, not just the in-memory last one.

Note (from real Signal data): visual naming is unreliable on Signal (sparse, misread
initials, lowercase/center names) — so reconciliation + the editor (which teaches
voiceprints on confirm) carry it; the editor remains the human arbiter. 59/59 XCTest.
2026-06-08 11:54:41 -05:00
Grant Gilliam 625c25cfc8 gitignore example-screenshots/ (personal call frames) 2026-06-08 11:01:34 -05:00
Grant Gilliam eb8a15b82e Configurable recap templates (categories per meeting type, in Settings)
Takeaways categories are no longer hardcoded — they're editable templates. A
template = the always-on TLDR + an ordered list of sections, each with a title, a
type (attributed items / bulleted list / paragraph), and an instruction (the prompt
text for that category). The analyzer assembles the LLM prompt FROM the template
and parses generically, so adding/removing/renaming a category needs zero code and
the output always renders.

- RecapTemplate / TemplateSection / SectionKind + TopicGranularity; built-in
  defaults (Internal Meeting, 1:1, Company/Sales Call), all editable.
- Generic extras: RecapExtras{tldr, primarySpeakers, sections:[RenderedSection]} +
  RecapItem{text,who,when,note} replaces the fixed MeetingExtras. Analyzer builds
  per-section sec_N fields + parses by kind; renderer + remap are generic.
- Topic granularity (coarse/auto/fine) answers 'should chunking be configurable' —
  it scales the target topic count; raw window sizes stay as tuned defaults.
- AppSettings persists templates + defaultTemplateId (seeded once). Settings gets a
  default-template picker + 'Manage…' → TemplatesView (CRUD, edit sections/
  instructions, set default, **Preview prompt** for full transparency).
- Recap editor gains a template picker; Regenerate uses the chosen template. Auto
  recap uses the default template.

54/54 XCTest (template prompt build, generic parse/remap/render updated).
2026-06-06 19:26:03 -05:00
Grant Gilliam 174fa91e48 Recap editor: Regenerate recap (re-run LLM on corrected transcript)
Adds a 'Regenerate recap' action so corrected speaker names flow into freshly
written summaries/extras (not just find-replaced). regenerate() commits the
corrections (rewrite speakers.json + reconcile voiceprints), re-runs RecapAnalyzer
on the corrected transcript via the gateway LLM, and rewrites recap.json +
transcript.md + recap.html. save() and regenerate() share commitCorrections();
both rebaseline the speaker set afterward so further edits map cleanly. Editor view
gains the button + progress spinner; RecapEditModel takes the gateway baseURL/skipTLS.

52/52 XCTest; builds clean.
2026-06-06 16:48:18 -05:00
Grant Gilliam c481c0103a Speaker corrections: rename / merge / reassign + voice learning
Native editor to fix speaker-ID errors after transcription (modeled on recap-relay's
correction UX): rename a speaker in the legend, merge two speakers, or reassign an
individual transcript line. Saving rewrites speakers.json, re-renders transcript.md +
recap.html, and updates the voiceprint memory — so a correction compounds: naming an
"Unknown" speaker teaches that voice for future calls.

- SpeakerEditing (pure, tested): replaceSpeaker (rename = merge-onto-existing),
  reassign, netNameMap (compose ops), and remap (apply a name map to a recap's
  structured fields + whole-word free text, so summaries/extras update without re-LLM).
- RecapEditModel (@MainActor): loads speakers.json (+ optional recap.json +
  cluster_fingerprints.json); on save writes the resolved speakers.json, re-renders,
  and reconciles voiceprints — merge keeps the survivor's print; rename/name-an-Unknown
  enrolls the cluster's fingerprint under the new name.
- TranscriptEditorView (SwiftUI) + EditorWindow (AppKit window for the LSUIElement app);
  menu gains "Edit speakers".
- Pipeline now persists cluster_fingerprints.json (every cluster incl. Unknown) and
  recap.json (RecapFile) so the editor can learn voices + re-render offline.
- RecapModels made Codable; TranscriptAssembler exposes allFingerprints;
  VoiceprintStore gains enroll() + merge().

52/52 XCTest (6 new, incl. a full rename→artifacts→voiceprint round-trip on disk).
2026-06-06 15:12:23 -05:00
Grant Gilliam 9240718e5e Recap: readable transcript + topic sections + meeting extras (gateway LLM)
New 'Recap' phase — turns speakers.json into a human-readable recap, leveraging
recap-relay's proven logic/prompts but calling the Spark gateway's OpenAI-compatible
/v1/chat/completions directly (same host/TLS as label-merge; Qwen3-35B). We start
from already-named speakers (label-merge), so recap-relay's speaker clustering +
name-inference are skipped entirely.

- GatewayLLMClient: /v1/chat/completions (JSON mode), model discovery via
  /api/endpoints, TLS-skip reuse, 503 retry, sequential.
- RecapAnalyzer: speakers.json → numbered [N] (MM:SS) Name: text transcript →
  time-windowed analyze (single window for short calls, 18min/2min overlap for long)
  → stitch/dedup topic sections → meeting extras (TLDR/decisions/action_items/
  open_questions/key_quotes). Defensive JSON parsing of LLM output.
- RecapRenderer: writes transcript.md + a self-contained dark-theme recap.html
  (topic sections w/ collapsible transcripts, extras panels, speaker color chips,
  full timestamped speaker-attributed transcript, print styles).
- SessionController.buildRecap: best-effort after speakers.json (gated by
  settings.recapEnabled); surfaces recapURL → menu 'Open recap'. Skips silently if
  the gateway has no LLM. Settings toggle added.

Validated END-TO-END on the real Meet session against the live gateway: dual-channel
transcription → 3 topic sections + accurate TLDR + key quotes; 'Go Bitcoin'
correctly attributed to the remote speaker. 46/46 XCTest (10 new).
2026-06-06 14:36:18 -05:00
Grant Gilliam cecc853dd8 Client: dual-channel label-merge (mic_file + system_file)
The backend shipped dual-channel mode; wire the client to it. We already capture
mic (you) and system (others) separately, so send them as two files instead of the
mono mix — fixing the misattribution at the source.

- SparkControlClient: labelMergeDual(mic_file, system_file, self_name, self_vad);
  multipart generalized to N files; shared POST/retry/decode extracted.
- SessionPackager.rebasedSelfVadData: chunk-local [{start,end}] for self_vad;
  sliceAudio reused for both tracks.
- TranscriptPipeline.process: dual-channel chunking (slice mic+system, rebase
  timeline + self_vad per chunk) when system audio is healthy; mono mixed-file
  fallback (self folded into the timeline) otherwise.
- VisualCapture.finish: write the full visual_timeline.json (remote + self merged)
  but return REMOTE (vision) segments only — self travels via the mic channel.
- TranscriptAssembler: rank mic_channel highest (the user's own track wins).
- VoiceprintStore: store the clean mic_channel self voiceprint.
- SessionController: pass mic/system URLs + remote timeline + channel self-spans +
  self_name + systemHealthy; self_vad.json now reflects the channel-verified spans.

Validated END-TO-END against the live backend on the real misattributing session:
'Go Bitcoin' (remote) is now attributed to Unknown_0, NOT the user; the user's own
lines come back source=mic_channel; per-channel ASR recovered fuller remote text.
36/36 XCTest (4 new: self_vad rebase, mic_channel ranking + voiceprint storage).
2026-06-06 13:15:29 -05:00
Grant Gilliam 71f0a3b74e Channel-verified self identity: the mic track is you
Grant's insight + proven on real session audio: we capture self (mic) and others
(system) as separate tracks, then throw the separation away by mixing to mono — so
the backend has to re-guess who's who. Analysis of a real call showed the channels
are cleanly separated (envelope corr 0.015, NO echo); Caitlyn's 'Go Bitcoin' was
11.8x louder in system than mic, yet the mono mix + noisy visual named it 'Grant'.

ChannelSelfVAD marks self-speech as windows where the mic is active AND louder than
system (mic > system x1.5). Benefits: (1) self is identified by CHANNEL, not by the
on-screen name — set one name in Settings, no per-platform matching; (2) a remote
speaker (or room echo) can never be mislabeled as self. Computed at finalize from
the two finished WAVs; the live capture path is untouched. Falls back to mic-VAD if
tracks can't be read. SessionController feeds these spans to the backend timeline.

Validated on the real session: 16 self spans; 'Go Bitcoin' (72-74s) correctly
EXCLUDED, Grant's 49.9-53.3s / 62.6-64s correctly INCLUDED. 33/33 XCTest (5 new).
2026-06-06 12:24:29 -05:00
Grant Gilliam e6955425dc Filter OCR to participant-name labels (kill visual-timeline noise)
Real Meet capture revealed the visual pipeline was treating ALL on-screen text as
participant names: meeting URL, clock, 'Add others' button, lobby 'Your meeting's
ready' dialog, 'Joined as …@gmail.com', etc. 46 of 52 'visual segments' in a real
session were phantom speakers. (The backend was unaffected — it diarizes from audio
and ignores names that match no voice cluster — but the visual_timeline.json and the
segment count were junk.)

GridCallAnalyzer.isLikelyName now gates OCR strings to things shaped like a name:
2–30 chars, 1–3 Title-Cased alphabetic words, no digits/URL/email/glyph punctuation.
Errs toward dropping (a missed name just loses a hint; audio diarization still runs).
Unit-tested against the EXACT 19 OCR strings from the real session: keeps the 5
real names, drops all 14 chrome strings. 28/28 XCTest.
2026-06-06 12:01:57 -05:00
Grant Gilliam dba306c900 Fix Signal (Electron) call detection: resolve mic-using helper to its app
Signal 1:1 (and group) calls didn't auto-record. Root cause confirmed on-device:
Signal is Electron and holds the mic in a HELPER process
(org.whispersystems.signal-desktop.helper.Renderer, a child of the main app).
detectViaMicAttribution only matched PIDs listed in NSWorkspace.runningApplications
against the main bundle ID, so the helper's mic use was never attributed to Signal.
(Zoom worked = single native process; Meet worked = browser resolved.)

Fix: iterate the mic-using PIDs and resolve each to its owning app by walking the
parent-process chain (sysctl KERN_PROC_PID → ppid) until an NSRunningApplication is
found. Helper PIDs that return nil directly now resolve to the main app. Validated
against the live Signal helpers: pids 2383/2372 → org.whispersystems.signal-desktop.
Superset of the old behavior, so Zoom/Meet detection is preserved (browser case now
also more robust); our own recording is still skipped (selfPID).
2026-06-06 11:50:58 -05:00
Grant Gilliam 11b50907e3 Per-platform colour-border sensitivity (Teams violet, Meet glow)
Cross-platform research (Grant) flagged that the colour-border cue differs by app;
checking the real brand colours against the detector found a concrete bug: the
global 0.5 saturation threshold MISSES Teams' violet ring (#6264A7 ≈ 0.41, light
variants ~0.27) entirely and Meet's lighter blue glow (#8ab4f8 ≈ 0.44). Those
adapters would have detected nothing.

- FrameSampler.saturatedPoints: add a tunable threshold + optional hue-band gate
  (degrees) so a lowered threshold doesn't pick up warm video.
- GridCallAnalyzer.Config: colorSaturation / colorMinBrightness / colorHueRange,
  plumbed to the colour-border path (defaults preserve prior behaviour).
- MeetAdapter sat→0.35 (catch the glow); TeamsAdapter sat→0.22 + hue 215–275°
  (catch the faint violet, reject other colours); ZoomAdapter sat 0.45 + hue
  40–150° (vivid green/yellow). Values are first-pass pending real-fixture
  calibration; the hue gate is the main calibration lever.

Tests: Teams now detects the faint violet ring and rejects a green one; Meet/Zoom
vivid cases still pass. 27/27 XCTest.
2026-06-06 10:51:12 -05:00
Grant Gilliam 9d789645b6 Surface whether visual capture ran on the last session
Visual capture falls back to audio-only silently, so the user couldn't tell if
it attached on a real call. SessionInfo now carries visualSegmentCount (nil =
audio-only; a count = visual ran, with that many vision-detected speaker
segments), shown in the menu as '… · N visual segments' or '… · audio-only'.
Makes the pending live-call validation unambiguous.
2026-06-06 10:21:44 -05:00
Grant Gilliam a52a296122 Wire visual capture into the recording lifecycle (failure-isolated)
Visual capture now runs alongside audio: on call start the session picks the
app's adapter, captures the call window on the SAME monotonic clock as the audio
(AudioRecorder.sharedT0Host), and on stop writes visual_timeline.json and hands
the backend the visual segments with mic-VAD self-spans merged. Any visual
failure (no adapter, no window, Screen Recording denied) leaves the session
recording audio-only — the proven path is never blocked or broken.

- CallDetector now emits DetectedCall{app, bundleID, windowID}: the exact
  CGWindowID of the matched Meet browser window (native apps → nil → largest).
- VisualCapture wraps VisualObserver + AdapterRegistry, writes visual_timeline.json.
- AudioRecorder.sharedT0Host() exposes the shared t0 for frame alignment.

Hardened per a 3-lens adversarial review (concurrency / failure-isolation /
data-flow), all 6 confirmed findings fixed:
- P0 (critical): startVisual could adopt a stale capture into a DIFFERENT session
  (cross-session SCStream leak + visual_timeline.json written to the wrong
  folder). Now gated on session identity — generation + recorder ===, still
  .recording — with fail-closed adoption; otherwise the stream is cancelled.
- P1: observer captured the browser's largest window, not the detected Meet
  window. Now targets the exact CGWindowID (pickWindowIndex, unit-tested),
  largest-area only as fallback.
- P2: a startVisual orphaned by a concurrent stop could leak a stream on quit.
  inFlightVisual is registered before the await and drained in prepareForTermination.
- P3: trailing visual gap/segment ends could exceed duration_sec. Clamped in
  VisualCapture (clampSegments/clampGaps, unit-tested).
- P4: capture pixel size used NSScreen.main scale; now uses the scale of the
  display actually hosting the window (OCR clarity on secondary displays).
- VisualObserver.stop() bounds stopCapture() with a 3s timeout (mirrors audio) so
  a wedged stream can't hang finalization.

25/25 XCTest pass. Live validation on real calls still pending.
2026-06-06 10:18:52 -05:00
Grant Gilliam dd3a5ce33d Adapters: add Meet, Zoom, Teams (coloured border) + adapter registry
Front-loads the remaining visual adapters per the Signal→Meet→Zoom priority.
All three reuse GridCallAnalyzer's coloured-border (saturated) detection path
and share the new bottom-left name anchor:

- GridCallAnalyzer: generalise nameAtBottom:Bool into a NameAnchor enum
  (.bottomCenter for Signal's centered footer, .bottomLeft for Meet/Zoom/Teams
  where the name hugs the tile's bottom-left corner, .center for completeness).
  tileRect estimates the tile up-and-right of a bottom-left name.
- MeetAdapter (Google-blue ring, browser-hosted), ZoomAdapter (green/yellow
  border), TeamsAdapter (violet ring): coloured-border on, white-border off,
  bottom-left names. Geometry constants are first-pass pending real fixtures.
- AdapterRegistry.adapter(for:) maps CallDetector.DetectedApp → AppAdapter so
  VisualObserver can be constructed when live visual capture is wired in;
  unmapped apps degrade to audio-only.

Synthetic 4-tile tests: Meet picks each blue-bordered speaker with no
adjacent-tile bleed, Zoom picks the green-bordered speaker, and Signal's
white-only detector correctly ignores a coloured border. 18/18 XCTest pass.
2026-06-06 09:57:53 -05:00
Grant Gilliam 06bb8233a1 Signal: detect the white speaking border (not a coloured one)
Signal's active-speaker cue is a 3px #ffffff rounded border (saturation ≈ 0),
which the saturation-based highlight detector could never see. Per the
Signal-Desktop source review:

- FrameSampler.thinWhitePoints: grid-sample near-white pixels that sit on a
  THIN structure (a non-white pixel within edgeGap on some axis) so a border/
  ring counts but a solid white blob (face, bright video) does not.
- GridCallAnalyzer: combine coloured (saturated) + white (thin) highlight
  pixels; exclude name-text regions so the white footer name can't be mistaken
  for the border; estimate the tile UP from the name footer (nameAtBottom);
  attribute each highlight pixel to exactly one tile by containment (nearest
  centre as tiebreak) so a border can't bleed into an adjacent tile.
- SignalAdapter: white border on, coloured off, name-at-bottom geometry.

Synthetic 4-tile harness now isolates each speaker with no adjacent-tile bleed;
all 15 XCTest cases pass. Real-screenshot geometry calibration still pending.
2026-06-06 09:52:10 -05:00
Grant Gilliam c46d7f81d8 Phases 2-6: detection, visual timeline, backend hand-off, voiceprints
Phase 2 (call detection): CallDetector using CoreAudio per-process mic
attribution (anarlog technique) — robust start+stop for Zoom/Teams/Signal/Meet,
ignoring our own recording; auto-record toggle. Built; pending live multi-app
confirmation by the user.

Phase 3 (visual timeline foundation): AppAdapter protocol + SpeakerObservation,
TimelineBuilder (hysteresis/overlap/self-merge/aliases), VisualTimeline (schema
1.1), TextRecognizer (Vision OCR), FrameSampler + GridCallAnalyzer (name OCR +
saturated-highlight active-speaker attribution), SignalAdapter, VisualObserver
(window capture; frames released, never saved; minimized->visual_gap, idle != gap).
Synthetic-frame tested; adapter geometry pending real Signal fixtures + live
VisualObserver validation.

Phase 5 (backend hand-off): SparkControlClient (multipart label-merge, sequential,
TLS-skip, 503 Retry-After/413), SessionPackager (chunk plan + WAV slice + timeline
slice/rebase), TranscriptAssembler + SpeakersFile, TranscriptPipeline. Validated
END-TO-END against the live backend (chunk -> label-merge -> speakers.json).

Phase 6 (voiceprints): VoiceprintStore (known_voiceprints, persist named
fingerprints, skip Unknown). Wired: 'Send to backend' button + transcript status,
auto-send toggle (default off) + self-name setting.

All adversarial-review findings fixed. App + XCTest suite build; tests pass.
2026-06-06 00:15:49 -05:00
Grant Gilliam 9218af0c45 Phase 1: dual-track audio capture → mixed-mono 16 kHz WAV + mic VAD
AudioRecorder captures system audio (ScreenCaptureKit) + mic (AVAudioEngine) on a
single serial ioQueue, one shared monotonic t0, time-driven writers (pad gaps /
trim overlaps) so tracks stay aligned, and an energy mic-VAD for 'self' spans.
AudioMixer sums the aligned tracks into mixed_mono_16k.wav. SessionController
drives a serialized start/stop state machine, writes the session folder +
self_vad.json, exposes live level meters, and finalizes on quit.

Hardening from review: ioQueue single-domain (no races), stop() never hangs
(mic-first teardown + bounded stopCapture), layout-agnostic mic deep-copy,
discard-only video output to keep SCStream alive, VAD lockstep on committed
frames, stable signing team in project.yml, single-instance enforcement.
2026-06-05 21:30:11 -05:00
Grant Gilliam 22447627c6 Phase 0: menu-bar scaffold, permissions, backend health check
Native SwiftUI menu-bar app (LSUIElement, macOS 13+), generated from project.yml
via XcodeGen. Includes:
- PermissionsManager (Microphone / Screen Recording / Accessibility) + UI
- SparkControlHealth: GET /api/status over self-signed TLS (InsecureTrustDelegate)
- AppSettings persistence (host, TLS-skip, output folder, adapter toggles)
- Menu-bar panel + Settings, app sandbox & hardened runtime off (LAN tool)
2026-06-05 19:33:53 -05:00
42 changed files with 233 additions and 1319 deletions
-1
View File
@@ -1 +0,0 @@
{}
-18
View File
@@ -17,21 +17,3 @@ build/
# Personal call screenshots / fixtures (faces, contact names) — never commit # Personal call screenshots / fixtures (faces, contact names) — never commit
example-screenshots/ example-screenshots/
# Local signing identity (Apple Team ID) — keep out of source; template is committed
Config/Signing.xcconfig
# Local env files (e.g. SPARK_BACKEND_URL for dev/harness runs) — never commit
.env
.env.*
!.env.example
# Claude Code — deny by default, allow-list shared wiring.
# .claude/ also accumulates worktrees, editor configs, and OS cruft; commit
# only the shared parts so new local scratch (or a stray secret) stays out.
.claude/*
!.claude/rules/
!.claude/agents/
!.claude/commands/
!.claude/skills/
!.claude/settings.json
-96
View File
@@ -1,96 +0,0 @@
# AGENTS.md — Ten31 Transcripts
Native macOS **menu-bar app** that detects video calls, records dual-track audio + watches the call window for active-speaker cues, and sends audio + a visual timeline to a self-hosted **SparkControl** backend that does transcription/diarization/naming — producing named transcripts and recaps.
> **Inbox check:** At session start, if `~/Projects/standards/INBOX.md` exists, scan it for items tagged `(ten31-transcripts)` and surface them before proposing next steps; triage with `/triage`.
## Stack (versions that matter)
- **Swift 5.0**, **SwiftUI** + AppKit, macOS **13.0** deployment target. `LSUIElement` (menu-bar only, no Dock icon).
- Project is generated by **XcodeGen** from `project.yml` (`brew install xcodegen`). `*.xcodeproj` is **gitignored** — regenerate, don't edit.
- Full Xcode lives at `/Applications/Xcode.app`, but `xcode-select` points at CommandLineTools → **set `DEVELOPER_DIR` for every `xcodebuild`**.
- Bundle id `xyz.ten31.transcripts`; `DEVELOPMENT_TEAM` (Apple Team ID) is set in a **gitignored `Config/Signing.xcconfig`** (copy `Config/Signing.xcconfig.example` and set your team). Keep it stable — a constant signing identity is what preserves TCC grants across rebuilds.
- Backend: SparkControl gateway at `$SPARK_BACKEND_URL` (a private LAN backend — IP or `.local` host; Start9 self-signed cert. Install the StartOS Root CA in the System keychain so normal TLS validation succeeds; skip-TLS is an opt-in, **host-scoped** escape hatch, **off by default** — see `InsecureTrustDelegate`). Resolution order: a value saved in **Settings → SparkControl backend** (UserDefaults) wins, else the `SPARK_BACKEND_URL` env var, else the placeholder default in `AppSettings.swift`. Diarization = Sortformer/TitaNet (**mono-only**, ~4 speakers/chunk); LLM = Qwen3 via OpenAI-compatible `/v1/chat/completions`; audio via `/api/audio/label-merge`.
## Commands
First time on a machine — create the local signing config (else `xcodegen generate`/signing won't find a team):
```
cp Config/Signing.xcconfig.example Config/Signing.xcconfig # then set DEVELOPMENT_TEAM
```
Regenerate the Xcode project (after adding/removing/renaming any source file):
```
xcodegen generate
```
Build + run all tests:
```
DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer xcodebuild test \
-project Ten31Transcripts.xcodeproj -scheme Ten31Transcripts \
-destination 'platform=macOS' -derivedDataPath /tmp/ten31-dd
```
Run a **single** test (target/class/method):
```
DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer xcodebuild test \
-project Ten31Transcripts.xcodeproj -scheme Ten31Transcripts \
-destination 'platform=macOS' -derivedDataPath /tmp/ten31-dd \
-only-testing:Ten31TranscriptsTests/SpeakerReconcilerTests/testCosine
```
Build only: replace `test` with `build`. **Lint/format:** none configured (no SwiftLint/SwiftFormat/Makefile); adding one is tracked in `ROADMAP.md`.
Build a standalone app and install/run it (Xcode does **not** need to stay open):
```
DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer xcodebuild \
-project Ten31Transcripts.xcodeproj -scheme Ten31Transcripts \
-configuration Release -derivedDataPath /tmp/ten31-release build
ditto /tmp/ten31-release/Build/Products/Release/Ten31Transcripts.app /Applications/Ten31Transcripts.app
open /Applications/Ten31Transcripts.app
```
**Fast validation harness** (preferred for visual/backend logic): compile the specific `Ten31Transcripts/**.swift` files plus a `main.swift` with `xcrun --sdk macosx swiftc -O ... main.swift -o x` and run against real fixtures (`example-screenshots/`) or saved sessions. Top-level code must live in the file literally named `main.swift`.
## Layout (day one)
- `Ten31Transcripts/App/``@main` entry + `AppDelegate`.
- `Ten31Transcripts/Session/``SessionController` (state machine), `TranscriptPipeline`, `SessionPackager` (chunking), `TranscriptAssembler`, `SpeakerReconciler`, `ChunkPlan` (`ChunkMode`), `SpeakersFile`, `SessionNaming` (pure folder-name + recap-title logic).
- `Ten31Transcripts/Visual/``VisualCapture`/`VisualObserver` (ScreenCaptureKit, ~3fps), `GridCallAnalyzer` (+ `FrameSampler`, `TextRecognizer`, `TimelineBuilder`, `VisualTimeline`, `SpeakerObservation`).
- `Ten31Transcripts/Adapters/` — per-app screen-readers (`MeetAdapter`, `ZoomAdapter`, `TeamsAdapter`, `SignalAdapter`) + `AdapterRegistry`.
- `Ten31Transcripts/Audio/``AudioRecorder`, `MicVAD`, `ChannelSelfVAD`, `AudioMixer`, `MonoTrackWriter`, `Resampler`.
- `Ten31Transcripts/Backend/``SparkControlClient`, `GatewayLLMClient`, `VoiceprintStore`, `SparkControlHealth`, `InsecureTrustDelegate` (TLS skip).
- `Ten31Transcripts/Recap/``RecapAnalyzer`, `RecapRenderer` (writes `transcript.md` + `recap.html`), `RecapModels`, `RecapTemplate`, `SpeakerEditing`, `RecapEditModel`.
- `Ten31Transcripts/{Detection,Permissions,Settings,UI,Support}/``CallDetector`/`AudioInputProcesses`/`MicActivityMonitor`; `PermissionsManager`; `AppSettings` (UserDefaults); SwiftUI views + AppKit window hosts; `Info.plist` + entitlements.
- `Ten31TranscriptsTests/` — XCTest. `example-screenshots/` — real fixtures (gitignored). `docs/`, `README.md`.
- **Runtime output** (default `~/Ten31Transcripts/sessions/<ts>_<app>/`, configurable in Settings): `mic.wav`, `system.wav`, `mixed_mono_16k.wav`, `self_vad.json`, `visual_timeline.json`, `speakers.json` (output), `cluster_fingerprints.json`, `recap.{html,json}`, `transcript.md`. The folder is created at session start as `<yyyy-MM-dd'T'HH-mm-ss>_<app>`; on stop the user can name the meeting and it's renamed to `<date>_<name>_<app>` (skipping keeps the auto stamp).
## Conventions
- Match the surrounding file's style; small reviewable diffs; comments explain **why**, not what.
- Write/extend XCTest alongside non-trivial changes; pure logic (chunking, reconciliation, analyzer math) is unit-tested offline.
- Commits: imperative mood, concise; authored by Grant. Push to the self-hosted Gitea remote `origin` (branch `main`, over SSH) after committing, with my approval; the remote URL lives in `.git/config`, kept out of source. Work on `main` — don't create feature branches unless I ask.
- **Gitea push gotcha:** `origin`'s URL uses a raw `.local` mDNS host that intermittently fails to resolve (`Could not resolve hostname`, or a push that connects then stalls). The `gitea-home` SSH alias (in `~/.ssh/config`) points at the **same** Gitea server (port 59916, user `git`) via a reliable HostName — the sibling `standards` repo uses it. Reliable fallback: `git push gitea-home:grant/ten31-transcripts.git main` then `git update-ref refs/remotes/origin/main main`. Repointing `origin` to the alias would make this permanent (not yet done).
- Never commit recordings, transcripts, screenshots, or the generated `*.xcodeproj`.
- No API keys/tokens/passwords in the repo. The backend host (`$SPARK_BACKEND_URL`) and the Apple Team ID (`Config/Signing.xcconfig`, gitignored) are kept out of source — real values live in Settings/UserDefaults and the local xcconfig. Build env vars: `DEVELOPER_DIR` (required) and optional `SPARK_BACKEND_URL`.
- **Git history scrubbed (2026-06-13):** the private backend host + LAN IP were purged from all commits via `git filter-repo` (replaced with the `your-spark-backend.local` placeholder) and force-pushed; 0 hits across refs. Pre-rewrite backup bundle: `../ten31-transcripts-prehistory-rewrite.bundle`. A **second rewrite the same day** purged two backend LAN IPs that had slipped into a docs/test commit, replacing them with RFC 5737 documentation IPs (`192.0.2.1`/`192.0.2.2`) and force-pushing; 0 hits across refs; backup bundle `../ten31-transcripts-pre-ip-scrub.bundle`. The Apple Team ID was intentionally **not** scrubbed (it's public in every signed binary) — don't re-flag it.
## Always
- Set `DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer` on every `xcodebuild`.
- Run `xcodegen generate` after adding/removing/renaming source files.
- Treat the backend as the owner of transcription, diarization, and speaker naming; the app only records, watches, packages, and reconciles hints.
- Identify **self by the mic channel** + the single name in Settings → Your name, and keep that name reserved so the LLM never assigns it to another speaker.
- Treat visual active-speaker cues as **naming hints over audio diarization** (the backbone): prefer sparse-but-correct detection over dense-but-wrong.
- Send the backend dual-channel (`mic_file` + `system_file`) when the system track is healthy, else the mono `mixed_mono_16k.wav`; keep backend calls **sequential** (one in flight).
- After any code change, rebuild Release + `ditto` to `/Applications` — the installed copy does **not** auto-update.
## Never
- **Never write video frames to disk** — analyze in-memory and release immediately (privacy non-negotiable).
- **Never add Co-Authored-By / "Generated with" / any AI or tool attribution** to commits or PRs.
- Never commit secrets, recordings, transcripts, or `example-screenshots/` (faces + contact names).
- Never do per-platform display-name matching for self (Zoom/Meet/Signal names differ) — channel + one canonical name only.
- Never treat a solid camera-off avatar tile (Meet's orange/magenta fill) as an active speaker — the real cue is a thin **hollow** coloured ring; require thin-edge + hue gate (see `GridCallAnalyzer.isHollow`, `FrameSampler.thinColoredPoints`).
- Never collapse adjacent same-speaker transcript segments (reverted by request) — one line per diarized utterance.
- Never let a session-folder name put the meeting name where the app label is parsed from: the app must stay the **last** `_`-segment (`SessionController.appLabel(from:)` reads `.split("_").last`; `SessionNaming` enforces this and disambiguates collisions on the name segment). Renames happen at `finish()`-time after files are closed — re-derive track URLs from the (possibly moved) folder, never from `RecordingResult`'s start-time paths.
- Never send call audio to a raw IP the user didn't configure. Offline backend checks: a `.local` mDNS host can't be resolved by a plain `swiftc`/URLSession binary (`-1009`) — use the **real app** or `curl`; but a **configured raw IP _is_ reachable from a plain swiftc URLSession binary** (that's how the TLS fix was verified offline).
- Never force-push a shared branch, and never push without my approval. (Work on `main` — don't create feature branches unless I ask.)
## Current state
Present tense; overwritten each session. `main` clean and pushed (HEAD `a5c227e`, pushed via the `gitea-home` alias — origin's `.local` host wouldn't resolve); `/Applications/Ten31Transcripts.app` rebuilt + installed from HEAD. **Full suite re-run: 91 pass** (was 73; +18 `SessionNamingTests`).
- **This session (2026-06-17) — meeting-name prompt + folder rename:** on stop, an NSAlert asks for a meeting name (Save/Skip) and the session folder is renamed `<ts>_<app>``<date>_<name>_<app>` (HH-MM-SS dropped; Skip/blank keeps the stamp). Pure logic in `SessionNaming` (sanitize, leaf compose, `recapTitle` for both forms); app label stays the last `_`-segment; collisions disambiguate on the name segment; `finish()` re-derives track URLs post-rename; quit never prompts and aborts an open prompt. Reviewer-reviewed; its P1 (quit-during-modal) + two P2s fixed.
- **Backend connected end-to-end:** real LAN URL saved in Settings → SparkControl backend (off-repo: `defaults read xyz.ten31.transcripts backendBaseURL`); committed default stays the placeholder.
- **Working:** backend hand-off (live), call detection (Meet/Zoom/Teams/Signal), dual-track capture, dual-channel + chunked send, speaker reconciliation, recap, speaker editor, configurable chunk length, standalone Settings, meeting-name prompt + readable folders.
- **Verify next (real app):** the naming prompt + rename is unit-tested + builds but **not yet exercised on a live stop** — run a real recording, stop, name it, confirm the folder renames and backend output lands in the renamed folder.
- **Next up:** (a) repoint `origin` to `gitea-home` so pushes stop hitting the flaky `.local` host (see Conventions); (b) **backend URL primary→fallback** + the `mmss()` NaN/∞ guard freebie (sketch first; keep real IPs out of source — use `192.0.2.x`).
- **In progress / unverified:** the Meet visual fix (reject solid camera-off tiles) still has no clean end-to-end run — re-process the saved Meet session + a fresh Meet call (needs real app + backend).
- **Known bugs / loose end:** sparse Meet speaking-detection (faint blue border); sub-second junk "self" mic fragments; desktop-mic vs phone doesn't unify by voiceprint. Doc loose end: `docs/01 §5`/`docs/02 §2.4` still list "AppleScript" as a Meet name source though the code uses window titles.
-1
View File
@@ -1 +0,0 @@
AGENTS.md
-4
View File
@@ -1,4 +0,0 @@
// Template for Config/Signing.xcconfig (which is gitignored).
// Copy to Config/Signing.xcconfig and set your Apple Developer Team ID
// (Xcode ▸ Settings ▸ Accounts, or `security find-identity -p codesigning -v`).
DEVELOPMENT_TEAM = YOUR_APPLE_TEAM_ID
-57
View File
@@ -1,57 +0,0 @@
# Evaluation — ten31-transcripts — 2026-06-13
Intent: A native macOS menu-bar app (Swift/SwiftUI/AppKit, macOS 13+, generated by XcodeGen) that auto-detects conference calls (Meet/Zoom/Teams/Signal), records dual-track audio while watching the call window via ScreenCaptureKit for active-speaker cues, and hands audio + a visual speaker timeline to a self-hosted SparkControl backend that performs transcription/diarization/speaker-naming — producing named transcripts and recaps.
Agents run: evaluator, security-auditor, exerciser, doc-auditor. Skipped: start9-spec-checker (no StartOS-wrapper markers found), reviewer (working tree clean — no diff to review).
## Verdict
This is a genuinely well-engineered personal tool: it builds cleanly with the documented `DEVELOPER_DIR` toolchain, all 69 tests pass in ~1s (claim verified empirically), the architecture is disciplined (the app records/watches/packages/reconciles and correctly delegates all ML to the backend), and secrets hygiene is verifiably clean — the documented 2026-06-13 history scrub survives a full-ref grep with zero leaked hosts or IPs. The headline risk is the TLS trust model: certificate validation is bypassed **globally for any host, on by default**, so anyone on the LAN can MITM the full upload of call audio, the visual timeline, and stored voiceprints — and the same bypass makes a reproducible recap-time crash (`mmss()` on a malformed `Double`) attacker-reachable. The second-largest issue is documentation: the README still describes "Phase 0 (scaffold)" for an app that has shipped through Phase 6, and the `docs/` specs have diverged from the dual-channel API and the recap phase. Code-wise this is close to ready for its single-user, LAN-only purpose; the fixes are well-scoped and mostly small. Fix the TLS model first — it gates the safety of every backend-integration test that follows.
## Cross-referenced findings
- **TLS bypass scope — contradiction resolved against the evaluator.** The evaluator rated Security 4 partly on the basis that the TLS-skip is "intentional/scoped" (`InsecureTrustDelegate.swift:9`). The security-auditor read the implementation and found it is **not** scoped: `URLCredential(trust:)` is returned for any host without a host/fingerprint/CA check (`InsecureTrustDelegate.swift:22`), and it is default-on (`AppSettings.swift:109`). The auditor's direct evidence wins; the Security lens is adjusted down accordingly (see Scorecard).
- **One attack chain, two agents.** The exerciser independently reproduced (twice) a fatal crash in `RecapAnalyzer.mmss()` on `Double.nan`/`Double.infinity` (`RecapAnalyzer.swift:137`), reachable when the backend returns e.g. `"duration": 1e400`. The security-auditor's P1 global TLS bypass is exactly what lets an on-LAN attacker *be* that backend. These are not two unrelated findings — the P1 bypass converts the P2 crash from "trust the backend" to "any LAN attacker can crash the app at recap time." Listed once each below, but they share an exploit path.
- **README staleness — corroborated by two agents.** Both the evaluator (P2) and the doc-auditor (multiple lines) independently flagged that `README.md` describes Phase 0 while the code is at Phase 6+, and both flagged the matching stale source comment at `AppSettings.swift:7`. Merged into one finding; the doc-auditor adds that the drift extends into the `docs/` design specs.
- **Test count — claim verified, not just asserted.** The evaluator and exerciser both built and ran the suite; "69 tests pass" (AGENTS.md) is confirmed by execution, not by counting `func test` declarations.
## Priority queue
- [P1] Global, unscoped TLS bypass trusts any certificate from any host (default-on) — anyone on the LAN can ARP/DNS-spoof the unauthenticated `.local` mDNS name and receive the full mic+system audio, visual timeline, and voiceprints, then return attacker-chosen transcripts — `InsecureTrustDelegate.swift:22`, wired at `SparkControlClient.swift:85`/`GatewayLLMClient.swift:36`/`SparkControlHealth.swift:35` — security-auditor
- [P2] Skip-TLS defaults to ON, so the P1 MITM window is open from first launch before any user choice — `AppSettings.swift:109` (`... as? Bool ?? true`) — security-auditor
- [P2] `RecapAnalyzer.mmss()` fatally crashes on NaN/±Infinity (reproduced twice); a malformed/MITM'd backend `duration` decodes to `Double.infinity` and aborts the app at recap-render time — `RecapAnalyzer.swift:137` (`Int(sec.rounded())`) — exerciser (exploit path opened by the P1 finding)
- [P2] README is stale by six phases — claims "Phase 0 (scaffold)… no audio capture, call detection, screen reading, or backend hand-off yet" for an app that has all of it; the same lie is in source comment `AppSettings.swift:7``README.md:7,49,51,56-66` vs. `Ten31Transcripts/{Audio,Detection,Visual,Session,Recap}/` — evaluator + doc-auditor
- [P2] `SessionController` (670 lines, the most concurrency-dense file: generations, in-flight task adoption, pending-auto-stop) has zero unit tests, while comparable pure logic is well covered — `SessionController.swift:256-282` — evaluator
- [P3] `docs/` design specs drifted from the implemented backend path: the dual-channel fields (`mic_file`/`system_file`/`self_name`/`self_vad`) are undocumented and the recap/LLM phase is absent — `docs/03_DATA_CONTRACTS.md:109-116`, `docs/02_ARCHITECTURE.md:51,197`, `docs/01_PROJECT_BRIEF.md:31,83,94`, `docs/04_BUILD_PLAN.md` (no recap phase) vs. `SparkControlClient.swift:106-130` / `RecapAnalyzer.swift:8-12` — doc-auditor
- [P3] `docs/01_PROJECT_BRIEF.md:142-153` §7 lists open items 25 (send trigger, retention, voiceprint-update policy, signing) that are already resolved in code — `AppSettings.swift:46`, `VoiceprintStore.swift:25`, `Config/Signing.xcconfig` — doc-auditor
- [P3] `docs/02_ARCHITECTURE.md:214-216` §2.10 claims MenuBarUI features (recent-sessions list with resend/delete, voiceprint manager) that are absent from the actual UI (`MenuBarView` surfaces only the single last session) — doc-auditor
- [P3] AGENTS.md Layout listings are incomplete: `Audio/` omits `AudioMixer`/`MonoTrackWriter`/`Resampler`, `Detection/` omits `AudioInputProcesses`/`MicActivityMonitor``AGENTS.md:50,53` — doc-auditor
- [P3] The `manifest.json` per-file `sha256` integrity contract is specified but never written by the pipeline — spec-vs-reality gap — `docs/03_DATA_CONTRACTS.md:61-63` — evaluator
- [P3] Env-var precedence footgun: a saved UserDefaults backend URL permanently shadows `SPARK_BACKEND_URL`, so the env var silently has no effect once Settings is touched (already noted in ROADMAP) — `AppSettings.swift:105-107`, `ROADMAP.md:23` — evaluator
- [P3] `SessionController` owns three jobs — recording state machine, backend-processing orchestration, and the saved-session/NSOpenPanel UI flow; extract the open/reprocess UI before the file grows — `SessionController.swift:467-535` — evaluator
- [P3] Unused, scary-looking `NSAppleEventsUsageDescription` entitlement string ("reads the active browser tab's URL") with no AppleEvents code path (Meet detection uses `CGWindowListCopyWindowInfo` titles only) — drop it — `Info.plist:33` — security-auditor
- [P3] Backend is unauthenticated by design — any LAN device that reaches it can drive transcription; consider a shared bearer token even on LAN — `docs/03_DATA_CONTRACTS.md:89` — security-auditor
- [P3] App Sandbox OFF + Hardened Runtime OFF (intentional, required for cross-app observation) leaves the app unconfined; keep the zero-dependency posture as a deliberate compensating control and document it as such — `project.yml:38` + entitlements — security-auditor
## Scorecard
The evaluator's six-lens table, with two lenses adjusted where another agent's evidence contradicts the evaluator's stated basis (adjustments noted):
| Lens | Score /5 | Notes |
|---|---|---|
| Architecture | 5 | Clean layering; ML delegated to backend per intent; pure/testable seams split from I/O. The single 670-line `SessionController` is the only concentration (P3 to extract). |
| Security | **3** (was 4) | **Adjusted down.** The evaluator's "TLS-skip is intentional/scoped" basis is contradicted by the security-auditor's read: the bypass is global/any-host (`InsecureTrustDelegate.swift:22`) and default-on. Otherwise strong — zero deps, no shell-out, verified-clean secrets, the "never write frames" privacy claim holds in code. |
| Performance | 5 | Idles near-zero; frames released immediately; grid-sampled vision with reused `CIContext`; sequential backend calls honor the single-GPU constraint. |
| Testing | 4 | 69 tests pass (verified by execution); they target the real load-bearing logic. Gap: the `SessionController` concurrency state machine is untested. |
| Code quality | 5 | Consistent style, comments explain *why*, zero warnings, no `try!`. One latent robustness ding: the `mmss()` NaN/∞ fatal (P2). |
| Documentation | **3** (was 4) | **Adjusted down.** The evaluator scored 4 calling `docs/` "excellent and true," but the doc-auditor's claim-by-claim pass found drift well beyond the README — the dual-channel API and the entire recap phase are undocumented across `docs/01-04`, and the build plan never mentions recap. |
## Disagreements & gaps
- **TLS scope (resolved).** Evaluator said "scoped" and scored Security 4; security-auditor read `InsecureTrustDelegate.swift:22` and found it global + default-on (P1). Resolved in favor of the auditor's direct evidence; Security adjusted to 3.
- **Documentation breadth (resolved).** Evaluator sampled `docs/` and judged them accurate (lens 4); doc-auditor did a claim-by-claim pass and found material drift in the specs, not just the README. Resolved in favor of the doc-auditor for the lens; adjusted to 3.
- **Shared blind spot (all runtime-capable agents).** None could exercise live end-to-end behavior — the SparkControl `.local` backend is unreachable from any of these environments by design, and the real on-call visual-cue accuracy needs the gitignored `example-screenshots/`. The Meet visual fix (reject solid camera-off tiles) therefore remains **unverified end-to-end**, which AGENTS.md "Current state" itself acknowledges. No agent could close this; it requires a real call on the user's machine.
## Suggested order of work
1. **Fix the TLS trust model first** — scope the override to the configured backend host and pin the Start9 root CA (or the leaf SPKI hash); default skip-TLS to `false`. This is the P1, and it is the precondition that makes any later backend-integration test trustworthy (it currently gates the P2 crash's reachability).
2. **Harden `Double`→`Int` conversions on backend-decoded values** — give `mmss()` a finite-guard fallback and audit sibling call sites; closes the recap-time crash chain that step 1 also narrows.
3. **Rewrite `README.md` to match the shipped app** and fix the `AppSettings.swift:7` "Phase 0" comment — the single highest-leverage doc change (first thing any newcomer reads).
4. **Reconcile the `docs/` specs** — document the dual-channel fields in `docs/03` §4 and `docs/02`, add the recap phase to `docs/01/02/04`, and close the already-resolved §7 open items.
5. **Add `SessionController` state-machine tests** (auto-start-then-immediate-call-end via `pendingAutoStop`; the visual-adoption generation guard) — do this *before* the next refactor so it has a safety net.
6. **Then extract the saved-session/open-panel UI** out of `SessionController` into a small coordinator.
7. **Run one real call end-to-end** on the user's machine to validate the unverified Meet visual fix and confirm `speakers.json` + `transcript.md` + `recap.html` are written correctly — only meaningful after step 1 makes that path safe.
+38 -116
View File
@@ -1,146 +1,68 @@
# Ten31 Transcripts # Ten31 Transcripts
Native macOS menu-bar app that auto-detects conference calls, records dual-track Native macOS menu-bar app that auto-detects conference calls, records local audio,
audio while watching the call window for active-speaker cues, and hands the audio builds a visual-derived speaker timeline, and hands audio + timeline to the
plus a visual speaker timeline to a self-hosted **SparkControl** backend that does SparkControl backend for naming/transcription. See `docs/` for the full spec.
the transcription, diarization, and speaker naming — producing named transcripts
and meeting recaps.
It runs as a menu-bar-only app (no Dock icon). All machine-learning work lives on This repo is at **Phase 0** (scaffold, permissions, backend health check).
the backend; the app only records, watches, packages, and reconciles hints.
## How it works
1. **Detect** — a call in Google Meet, Zoom, Teams, or Signal starts; `CallDetector`
notices and (optionally) auto-starts a session.
2. **Record + watch** — dual-track audio (your mic + system output) is captured while
`ScreenCaptureKit` samples the call window (~3 fps) to read names and spot the
active speaker. Video frames are analyzed in memory and released immediately —
**never written to disk**.
3. **Package + send** — audio is chunked and sent to the backend, dual-channel
(`mic_file` + `system_file`) when the system track is healthy, else a mono mix.
The visual timeline rides along as naming hints. Backend calls are sequential
(one in flight) to respect the single-GPU backend.
4. **Transcribe + name** — the backend diarizes (Sortformer/TitaNet) and an LLM
(Qwen3, via an OpenAI-compatible endpoint) assigns names, helped by the visual
hints and your stored voiceprints.
5. **Reconcile + recap** — the app reconciles speaker hints, then writes a readable
`transcript.md` and an HTML `recap.html`. A built-in speaker editor lets you fix
names after the fact.
**You** are identified by the mic channel plus the single name in *Settings → Your
name* — that name is reserved so the LLM never assigns it to anyone else. (There's
no per-platform display-name matching; your Zoom/Meet/Signal names can all differ.)
## One-time setup ## One-time setup
1. **Install Xcode** from the Mac App Store (free; large download). Open it once and 1. **Install Xcode** from the Mac App Store (free; ~40 GB). Open it once and
accept the license prompt. accept the license prompt.
2. **Install XcodeGen** (generates the Xcode project from `project.yml`): 2. **Install XcodeGen** (generates the Xcode project from `project.yml`):
```sh ```sh
brew install xcodegen brew install xcodegen
``` ```
3. **Set your signing team.** The Apple Team ID is kept out of source in a gitignored 3. **Generate the project:**
`Config/Signing.xcconfig`. Copy the template and set your team:
```sh
cp Config/Signing.xcconfig.example Config/Signing.xcconfig # then set DEVELOPMENT_TEAM
```
`xcodegen` wires it in via `configFiles`, so **Signing & Capabilities** shows the
team automatically. Keep the value stable so macOS preserves the app's permission
(TCC) grants across rebuilds. Edit the xcconfig, not Xcode — `xcodegen generate`
overwrites Xcode-side changes.
4. **Generate the project** (re-run any time you add/remove/rename a source file):
```sh ```sh
xcodegen generate xcodegen generate
``` ```
This creates `Ten31Transcripts.xcodeproj` (gitignored — regenerate, don't edit). This creates `Ten31Transcripts.xcodeproj` (git-ignored — regenerate any time).
4. **Open it:**
```sh
open Ten31Transcripts.xcodeproj
```
5. Signing is preconfigured: `project.yml` sets `DEVELOPMENT_TEAM` to the free
personal team `BK4Y6CXN35` with automatic signing, so **Signing & Capabilities
should already show the team** — no manual selection needed. (If you ever sign
with a different Apple ID, update `DEVELOPMENT_TEAM` in `project.yml`, not in
Xcode — `xcodegen generate` overwrites Xcode-side changes.)
6. Press **Run** (⌘R).
## Build & run > **Note:** after adding files in a new phase, re-run `xcodegen generate` and let
> Xcode reload the project. The signing team persists because it lives in
> `project.yml`, so macOS permissions stay granted across rebuilds.
The simplest path is to open `Ten31Transcripts.xcodeproj` and press **Run** (⌘R). ## What Phase 0 does
To build a standalone app and install it (Xcode doesn't need to stay open) — note the - Launches as a menu-bar-only app (no Dock icon).
`DEVELOPER_DIR` prefix: full Xcode lives at `/Applications/Xcode.app` but - Menu panel shows live status for the three permissions it needs — **Microphone**,
`xcode-select` may point at the Command Line Tools, so set it on **every** **Screen Recording**, **Accessibility** — with Grant / Open Settings buttons.
`xcodebuild`: - Shows a **backend health check** (`GET /api/status`) against the configured host.
- **Settings:** backend base URL, skip-TLS toggle (on by default for the
self-signed cert), output folder, and adapter toggles (inert this phase).
```sh No audio capture, call detection, screen reading, or backend hand-off yet — those
DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer xcodebuild \ arrive in Phases 16 (`docs/04_BUILD_PLAN.md`).
-project Ten31Transcripts.xcodeproj -scheme Ten31Transcripts \
-configuration Release -derivedDataPath /tmp/ten31-release build
ditto /tmp/ten31-release/Build/Products/Release/Ten31Transcripts.app /Applications/Ten31Transcripts.app
open /Applications/Ten31Transcripts.app
```
The installed copy does **not** auto-update — rebuild and `ditto` again after changes.
Run the test suite:
```sh
DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer xcodebuild test \
-project Ten31Transcripts.xcodeproj -scheme Ten31Transcripts \
-destination 'platform=macOS' -derivedDataPath /tmp/ten31-dd
```
## Permissions
The menu panel shows live status for the three permissions the app needs, each with
Grant / Open Settings buttons:
- **Microphone** — to record your side of the call.
- **Screen Recording** — to capture system audio and watch the call window.
- **Accessibility** — to read window/participant information.
## Backend setup
Point the app at your SparkControl backend in **Settings → SparkControl backend**.
The resolution order is: the value saved in Settings (UserDefaults) wins, else the
`SPARK_BACKEND_URL` env var, else a neutral placeholder default. The committed
default is only a placeholder (`https://your-spark-backend.local`) — your real LAN
URL lives in Settings and never touches source.
The backend sits behind a Start9 self-signed Root CA. The supported path is to
**install the StartOS Root CA in your System keychain**, after which normal TLS
validation succeeds. *Skip TLS verification* is an opt-in escape hatch, **off by
default** and **scoped to the configured backend host** — it never becomes
"trust any server."
## Output
Each session writes to `~/Ten31Transcripts/sessions/<timestamp>_<app>/` (configurable
in Settings):
```
mic.wav system.wav mixed_mono_16k.wav # audio (dual-track + mono mix)
self_vad.json visual_timeline.json # self voice-activity + visual hints
speakers.json cluster_fingerprints.json # reconciled speakers + voiceprints
transcript.md recap.html recap.json # final outputs
```
## Project layout ## Project layout
``` ```
project.yml # XcodeGen recipe → generates the .xcodeproj project.yml # XcodeGen recipe → generates the .xcodeproj
Ten31Transcripts/ Ten31Transcripts/
App/ @main entry + AppDelegate App/ Ten31TranscriptsApp.swift, AppDelegate.swift
Detection/ CallDetector — which app is in a call UI/ MenuBarView, SettingsView, PermissionRow
Audio/ dual-track capture, mixing, resampling, self-VAD Permissions/PermissionsManager.swift
Visual/ ScreenCaptureKit capture + grid analysis → speaker timeline Backend/ SparkControlHealth.swift, InsecureTrustDelegate.swift
Adapters/ per-app screen-readers (Meet, Zoom, Teams, Signal) + registry Settings/ AppSettings.swift
Session/ SessionController state machine, packaging, reconciliation Support/ Info.plist, Ten31Transcripts.entitlements
Backend/ SparkControl + LLM clients, voiceprint store, TLS handling Ten31TranscriptsTests/ # placeholder; real tests land in Phase 3
Recap/ transcript.md + recap.html rendering, speaker editor
Permissions/ Settings/ UI/ Support/ (permissions, AppSettings, views, Info.plist)
Ten31TranscriptsTests/ # XCTest — pure logic (chunking, reconciliation, analyzer math)
docs/ # architecture & data-contract design notes
``` ```
## Notes ## Notes
- **App Sandbox is off** and **Hardened Runtime is off** — this is a personal, - **App Sandbox is off** and **Hardened Runtime is off** — this is a personal,
LAN-only tool that must observe other apps. Revisit only if distributing. LAN-only tool that must observe other apps. Revisit only if distributing.
- **Privacy:** video frames are never written to disk; recordings, transcripts, and - The default backend host is `https://immense-voyage.local:62419` (editable in
screenshots are gitignored and never committed. Settings).
- `AGENTS.md` is the canonical reference for build commands, conventions, and current
state; `ROADMAP.md` holds the backlog; `docs/` holds the architecture and
data-contract design notes.
-34
View File
@@ -1,34 +0,0 @@
# ROADMAP — Ten31 Transcripts
Longer-term backlog and deferred decisions. Near-term status + the next few steps live in `AGENTS.md` → Current state.
## Visual detection
- Improve Meet faint-blue-border detection (currently sparse): infer tile columns from name-label spacing for reliable per-tile geometry, and/or key on the audio-wave pill.
- Geometric screen-share exclusion: ignore OCR text in the shared-screen region (needs layout detection). Today only the domain filter + stuck-span guard catch share-text-as-speaker.
- Speaker-view / spotlight layout: detect the one-dominant-tile case (active speaker is the large tile with no border) instead of assuming a grid.
- Apply Meet's thin-edge + hollow-ring + hue gating to Zoom/Teams if real fixtures show solid-tile false positives there.
- 1:1 Signal: audio-pill fallback (no active border ever appears in 1:1).
- Accessibility-tree name source for Electron/Meet (cleaner than OCR); `AppAdapter.namesFromAccessibility` hook exists but returns nil.
## Platform support
- Jitsi: add call detection + a `JitsiAdapter` (Jitsi Meet is browser-based like Google Meet — needs `CallDetector` title recognition, an adapter for participant-name reading, and active-speaker visual cues). New platform alongside Meet/Zoom/Teams/Signal.
## Audio / speakers
- Self mic-channel cleanup: tighten self-VAD / smooth self so sub-second junk "self" fragments stop surviving (self is currently protected from fragment-smoothing).
- Adaptive chunk sizing from the backend's first-chunk speaker count, instead of the visual participant estimate.
## App / UX
- Per-app recording control: call detection is all-or-nothing; the adapter toggle only gates visual capture, not whether the app records.
- Constrain recap reading width on very wide windows (long line length in the summary band).
## Tooling / repo
- Decide whether to add a linter/formatter (SwiftLint/SwiftFormat) — none configured today.
- `SPARK_BACKEND_URL` is read only at `AppSettings.init` and is shadowed by any value already saved in Settings (UserDefaults wins). So once a backend URL has been saved, the env var has no effect — a stale stored value can override it in dev/CI/harness runs. If that bites, treat an empty/placeholder stored URL as absent so the env var can still win.
## Quality / debt (from the 2026-06-13 independent eval — full queue + evidence in `EVALUATION.md`)
- Guard `RecapAnalyzer.mmss()` (`:137`) against NaN/∞ — a malformed backend `duration` aborts the app at recap render (eval P2). Cheap; fold into the next backend change.
- Add `SessionController` state-machine tests (`pendingAutoStop`, visual-adoption generation guard) before refactoring; then extract its saved-session / open-panel UI (eval P2/P3).
- Smaller P3s in `EVALUATION.md`: whether to actually emit the `manifest.json` per-file `sha256` (now documented as not-emitted in `docs/03` §2); unauthenticated LAN backend (consider a bearer token).
## Deferred decisions
- Cross-device self unification (same person, desktop mic vs phone speakerphone) does not work by voiceprint and is treated as a separate identity; revisit only if a reliable signal emerges (mic-channel-as-self remains the robust path).
@@ -32,16 +32,6 @@ struct MeetAdapter: AppAdapter {
// The bright ring (#1a73e8) is ~0.89 sat but the lighter glow (#8ab4f8) is // The bright ring (#1a73e8) is ~0.89 sat but the lighter glow (#8ab4f8) is
// ~0.44, below the 0.5 default lower the threshold so the glow registers. // ~0.44, below the 0.5 default lower the threshold so the glow registers.
config.colorSaturation = 0.35 config.colorSaturation = 0.35
// Meet's active cue is a thin BLUE (210°) ring + audio pill. Detect thin blue
// EDGES only, gated to blue: this rejects solid camera-off avatar tiles (orange
// 30°, magenta 340°), which otherwise read as "speaking" for the whole call
// and collapse every remote voice onto one name. Validated on real fixtures.
config.coloredBorderThinOnly = true
config.colorHueRange = 180...240
// Meet's blue border is faint; real rings measure 0.200.30 interior fill while
// solid tiles measure 0.36, so allow a higher fill here than the 0.2 default to
// recover real borders without readmitting the solid-tile false positives.
config.maxInteriorFill = 0.3
config.tileExpandX = 3.0 config.tileExpandX = 3.0
config.tileExpandY = 5.0 config.tileExpandY = 5.0
self.analyzer = GridCallAnalyzer(config: config) self.analyzer = GridCallAnalyzer(config: config)
@@ -3,8 +3,9 @@ import SwiftUI
/// Menu-bar-only app entry point. /// Menu-bar-only app entry point.
/// ///
/// `LSUIElement` (set in Info.plist) keeps the app out of the Dock; the /// `LSUIElement` (set in Info.plist) keeps the app out of the Dock; the
/// `MenuBarExtra` scene provides the status-bar item and its panel, which wires /// `MenuBarExtra` scene provides the status-bar item and its panel. Phase 0 only
/// up permissions, settings, recording control, and the backend health check. /// wires up permissions, settings, and a backend health check no audio,
/// capture, or call detection yet.
@main @main
struct Ten31TranscriptsApp: App { struct Ten31TranscriptsApp: App {
@NSApplicationDelegateAdaptor(AppDelegate.self) private var appDelegate @NSApplicationDelegateAdaptor(AppDelegate.self) private var appDelegate
+1 -1
View File
@@ -14,7 +14,7 @@ struct RecordingResult {
let systemNote: String? let systemNote: String?
} }
/// Dual-track local audio capture. /// Dual-track local audio capture for Phase 1.
/// ///
/// - System audio via `SCStream` (`capturesAudio`); its audio handler runs on /// - System audio via `SCStream` (`capturesAudio`); its audio handler runs on
/// `ioQueue`. A discard-only video output runs on `screenQueue` purely to keep /// `ioQueue`. A discard-only video output runs on `screenQueue` purely to keep
+2 -2
View File
@@ -13,8 +13,8 @@ struct VADSpan: Equatable {
/// internal sample cursor always equals the mic file position, and span times /// internal sample cursor always equals the mic file position, and span times
/// land on the same instants as `mixed_mono_16k.wav`. /// land on the same instants as `mixed_mono_16k.wav`.
/// ///
/// `TimelineBuilder` folds these in as high-confidence pre-seeded "self" /// Phase 3's `TimelineBuilder` will fold these in as high-confidence pre-seeded
/// segments. Thresholds are intentionally simple. /// "self" segments. Thresholds are intentionally simple and will be tuned later.
/// ///
/// Single-threaded: all calls happen on `AudioRecorder.ioQueue`. /// Single-threaded: all calls happen on `AudioRecorder.ioQueue`.
final class MicVAD { final class MicVAD {
@@ -33,9 +33,7 @@ final class GatewayLLMClient {
config.timeoutIntervalForRequest = 600 config.timeoutIntervalForRequest = 600
config.timeoutIntervalForResource = 900 config.timeoutIntervalForResource = 900
config.waitsForConnectivity = false config.waitsForConnectivity = false
let delegate: URLSessionDelegate? = skipTLS let delegate: URLSessionDelegate? = skipTLS ? InsecureTrustDelegate() : nil
? InsecureTrustDelegate(allowedHost: URL(string: self.baseURL)?.host)
: nil
self.urlSession = URLSession(configuration: config, delegate: delegate, delegateQueue: nil) self.urlSession = URLSession(configuration: config, delegate: delegate, delegateQueue: nil)
} }
@@ -1,42 +1,19 @@
import Foundation import Foundation
/// URLSession delegate that bypasses certificate validation for **one host only** /// URLSession delegate that trusts the server certificate without validation.
/// the configured SparkControl backend.
/// ///
/// SparkControl sits behind a Start9 self-signed Root CA on the LAN. The supported /// SparkControl sits behind a Start9 self-signed Root CA on the LAN, so default
/// path is to install that CA in the System keychain; default trust evaluation then /// trust evaluation rejects it. This delegate is used **only** when the
/// succeeds and this delegate is never used. It exists only as an opt-in escape /// "Skip TLS verification" setting is on. It trusts any server certificate
/// hatch (the "Skip TLS verification" setting, off by default) for a machine where /// acceptable for a personal tool on a trusted local network and nothing else.
/// the CA isn't installed. Even then it trusts a certificate only when the challenge
/// host equals `allowedHost` a server-trust challenge from any other host falls
/// back to default validation, so the bypass can never become "trust any server".
final class InsecureTrustDelegate: NSObject, URLSessionDelegate { final class InsecureTrustDelegate: NSObject, URLSessionDelegate {
/// The single host the bypass is scoped to (the configured backend host). When
/// nil only reachable via a malformed base URL the gate never fires and every
/// challenge falls back to default validation: the safe degenerate case.
private let allowedHost: String?
init(allowedHost: String?) {
self.allowedHost = allowedHost
}
/// The security gate: the trust override may fire only for a server-trust
/// challenge whose host matches `allowedHost`. Pure and synchronous so the
/// host-scoping can be unit-tested without fabricating a `SecTrust`; the
/// credential itself is built only when this is true *and* a serverTrust exists.
func allowsTrustOverride(for space: URLProtectionSpace) -> Bool {
guard let allowedHost else { return false }
return space.authenticationMethod == NSURLAuthenticationMethodServerTrust
&& space.host == allowedHost
}
func urlSession( func urlSession(
_ session: URLSession, _ session: URLSession,
didReceive challenge: URLAuthenticationChallenge, didReceive challenge: URLAuthenticationChallenge,
completionHandler: @escaping (URLSession.AuthChallengeDisposition, URLCredential?) -> Void completionHandler: @escaping (URLSession.AuthChallengeDisposition, URLCredential?) -> Void
) { ) {
guard guard
allowsTrustOverride(for: challenge.protectionSpace), challenge.protectionSpace.authenticationMethod == NSURLAuthenticationMethodServerTrust,
let serverTrust = challenge.protectionSpace.serverTrust let serverTrust = challenge.protectionSpace.serverTrust
else { else {
completionHandler(.performDefaultHandling, nil) completionHandler(.performDefaultHandling, nil)
@@ -82,9 +82,7 @@ final class SparkControlClient {
config.timeoutIntervalForRequest = 600 // diarization can take up to ~600s config.timeoutIntervalForRequest = 600 // diarization can take up to ~600s
config.timeoutIntervalForResource = 900 config.timeoutIntervalForResource = 900
config.waitsForConnectivity = false config.waitsForConnectivity = false
let delegate: URLSessionDelegate? = skipTLS let delegate: URLSessionDelegate? = skipTLS ? InsecureTrustDelegate() : nil
? InsecureTrustDelegate(allowedHost: URL(string: self.baseURL)?.host)
: nil
self.urlSession = URLSession(configuration: config, delegate: delegate, delegateQueue: nil) self.urlSession = URLSession(configuration: config, delegate: delegate, delegateQueue: nil)
} }
@@ -1,10 +1,10 @@
import Foundation import Foundation
import Combine import Combine
/// Performs the backend reachability check: `GET {baseURL}/api/status`. /// Performs the Phase 0 backend reachability check: `GET {baseURL}/api/status`.
/// ///
/// This is a thin slice; the full upload path (label-merge, multipart, sequential /// This is a thin slice the full `SparkControlClient` (label-merge, multipart,
/// queueing, retries) lives in `SparkControlClient`. /// sequential queueing, retries) arrives in Phase 5.
@MainActor @MainActor
final class SparkControlHealth: ObservableObject { final class SparkControlHealth: ObservableObject {
@@ -32,9 +32,7 @@ final class SparkControlHealth: ObservableObject {
config.timeoutIntervalForRequest = 8 config.timeoutIntervalForRequest = 8
config.waitsForConnectivity = false config.waitsForConnectivity = false
let delegate: URLSessionDelegate? = skipTLS let delegate: URLSessionDelegate? = skipTLS ? InsecureTrustDelegate() : nil
? InsecureTrustDelegate(allowedHost: url.host)
: nil
let session = URLSession(configuration: config, delegate: delegate, delegateQueue: nil) let session = URLSession(configuration: config, delegate: delegate, delegateQueue: nil)
defer { session.finishTasksAndInvalidate() } defer { session.finishTasksAndInvalidate() }
+57 -102
View File
@@ -82,10 +82,6 @@ enum RecapRenderer {
// MARK: - HTML // MARK: - HTML
/// Mirror of recap-relay's job-output view: a header, an optional band of recap
/// cards (summary + takeaways), then a two-pane split topic list on the left,
/// full diarized transcript on the right, click a topic to jump + highlight its
/// range. Self-contained (data baked in; the click handler is inline JS).
static func html(file: SpeakersFile, result: RecapResult, title: String, static func html(file: SpeakersFile, result: RecapResult, title: String,
entries: [RecapAnalyzer.Entry]) -> String { entries: [RecapAnalyzer.Entry]) -> String {
let speakers = RecapAnalyzer.orderedSpeakerNames(entries) let speakers = RecapAnalyzer.orderedSpeakerNames(entries)
@@ -95,78 +91,66 @@ enum RecapRenderer {
return "<span class=\"chip\" style=\"background:\(c)\">\(esc(name))</span>" return "<span class=\"chip\" style=\"background:\(c)\">\(esc(name))</span>"
} }
// Header: title, meta line, speaker legend. var body = ""
let sub = "\(esc(file.app)) · \(RecapAnalyzer.mmss(file.durationSec))" let sub = "\(esc(file.app)) · \(RecapAnalyzer.mmss(file.durationSec))"
+ (speakers.isEmpty ? "" : " · \(speakers.count) speaker\(speakers.count == 1 ? "" : "s")") + (speakers.isEmpty ? "" : " · \(speakers.count) speaker\(speakers.count == 1 ? "" : "s")")
var header = "<div class=\"header\"><div class=\"htext\"><h1>\(esc(title))</h1><div class=\"meta\">\(sub)</div></div>" body += "<header><h1>\(esc(title))</h1><div class=\"sub\">\(sub)</div>"
if !speakers.isEmpty { if !speakers.isEmpty {
header += "<div class=\"legend\">" + speakers.map { chip($0) }.joined() + "</div>" body += "<div class=\"legend\">" + speakers.map { chip($0) }.joined() + "</div>"
} }
header += "</div>" body += "</header>"
// Recap cards band (summary + template takeaways).
var cards = ""
if let x = result.extras { if let x = result.extras {
if !x.tldr.isEmpty { if !x.tldr.isEmpty {
cards += card("Summary", "<p>\(esc(x.tldr))</p>" body += card("Summary", "<p>\(esc(x.tldr))</p>"
+ (x.primarySpeakers.isEmpty ? "" : "<p class=\"muted\">Primary: \(x.primarySpeakers.map(esc).joined(separator: ", "))</p>")) + (x.primarySpeakers.isEmpty ? "" : "<p class=\"muted\">Primary: \(x.primarySpeakers.map(esc).joined(separator: ", "))</p>"))
} }
for section in x.sections where !section.isEmpty { for section in x.sections where !section.isEmpty {
switch section.kind { switch section.kind {
case .paragraph: case .paragraph:
cards += card(section.title, "<p>\(esc(section.paragraph))</p>") body += card(section.title, "<p>\(esc(section.paragraph))</p>")
case .bullets: case .bullets:
cards += card(section.title, "<ul>" + section.bullets.map { "<li>\(esc($0))</li>" }.joined() + "</ul>") body += card(section.title, "<ul>" + section.bullets.map { "<li>\(esc($0))</li>" }.joined() + "</ul>")
case .items: case .items:
let lis = section.items.map { item -> String in let lis = section.items.map { item -> String in
var s = "<li>\(esc(item.text))" var s = "<li>\(esc(item.text))"
if let who = item.who { s += " <span class=\"who\">\(esc(who))</span>" } if let who = item.who { s += " <strong>\(esc(who))</strong>" }
if let note = item.note { s += " <span class=\"note\">(\(esc(note)))</span>" } if let note = item.note { s += " <span class=\"muted\">(\(esc(note)))</span>" }
if let when = item.when { s += " <span class=\"ts-badge\">\(RecapAnalyzer.mmss(Double(when)))</span>" } if let when = item.when { s += " <span class=\"ts\">\(RecapAnalyzer.mmss(Double(when)))</span>" }
return s + "</li>" return s + "</li>"
}.joined() }.joined()
cards += card(section.title, "<ul>\(lis)</ul>") body += card(section.title, "<ul>\(lis)</ul>")
} }
} }
} }
let band = cards.isEmpty ? "" : "<div class=\"band\">\(cards)</div>"
// Left pane: topic cards (click to jump). data-start/data-end index entries. if !result.sections.isEmpty {
var left = "<div class=\"left\">" var topics = ""
if result.sections.isEmpty { for (i, sec) in result.sections.enumerated() {
left += "<div class=\"empty\">No topic sections.</div>" let range = entries.indices.contains(sec.startIndex) && entries.indices.contains(sec.endIndex)
} else { ? "<span class=\"ts\">\(RecapAnalyzer.mmss(entries[sec.startIndex].offset))\(RecapAnalyzer.mmss(entries[sec.endIndex].end))</span>" : ""
for sec in result.sections { topics += "<details class=\"topic\"><summary><span class=\"tnum\">\(i + 1)</span> \(esc(sec.title)) \(range)</summary>"
let s = max(0, min(sec.startIndex, entries.count - 1)) if !sec.summary.isEmpty { topics += "<p>\(esc(sec.summary))</p>" }
let e = max(s, min(sec.endIndex, entries.count - 1)) topics += "<div class=\"turns\">" + turnsHtml(sec, entries: entries, chip: chip) + "</div></details>"
let time = entries.indices.contains(s) && entries.indices.contains(e)
? "<span class=\"chunk-time\">\(RecapAnalyzer.mmss(entries[s].offset))\(RecapAnalyzer.mmss(entries[e].end))</span>" : ""
left += "<div class=\"chunk\" data-start=\"\(s)\" data-end=\"\(e)\" onclick=\"jump(this)\">"
+ "<div class=\"chunk-title\">\(esc(sec.title))\(time)</div>"
+ (sec.summary.isEmpty ? "" : "<div class=\"chunk-summary\">\(esc(sec.summary))</div>")
+ "</div>"
} }
body += card("Topics", topics)
} }
left += "</div>"
// Right pane: full diarized transcript, one line per turn (id=entry-i). let full = entries.map { "<div class=\"turn\"><span class=\"ts\">\(RecapAnalyzer.mmss($0.offset))</span> \(chip($0.speaker)) <span class=\"txt\">\(esc($0.text))</span></div>" }.joined()
var right = "<div class=\"right\">" body += "<details class=\"topic\" open><summary>Full Transcript</summary><div class=\"turns\">\(full)</div></details>"
if entries.isEmpty {
right += "<div class=\"empty\">No transcript.</div>"
} else {
for (i, en) in entries.enumerated() {
right += "<div class=\"transcript-line\" id=\"entry-\(i)\">"
+ "<span class=\"ts-badge\">\(RecapAnalyzer.mmss(en.offset))</span>"
+ chip(en.speaker)
+ "<span class=\"ts-text\">\(esc(en.text))</span></div>"
}
}
right += "</div>"
let body = header + band + "<div class=\"split\">\(left)\(right)</div>"
return htmlShell(title: esc(title), body: body) return htmlShell(title: esc(title), body: body)
} }
private static func turnsHtml(_ sec: TopicSection, entries: [RecapAnalyzer.Entry],
chip: (String) -> String) -> String {
guard sec.startIndex <= sec.endIndex, entries.indices.contains(sec.startIndex), entries.indices.contains(sec.endIndex)
else { return "" }
return entries[sec.startIndex...sec.endIndex].map {
"<div class=\"turn\"><span class=\"ts\">\(RecapAnalyzer.mmss($0.offset))</span> \(chip($0.speaker)) <span class=\"txt\">\(esc($0.text))</span></div>"
}.joined()
}
private static func card(_ title: String, _ inner: String) -> String { private static func card(_ title: String, _ inner: String) -> String {
"<section class=\"card\"><h2>\(esc(title))</h2>\(inner)</section>" "<section class=\"card\"><h2>\(esc(title))</h2>\(inner)</section>"
} }
@@ -192,63 +176,34 @@ enum RecapRenderer {
<meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="viewport" content="width=device-width, initial-scale=1">
<title>\(title)</title> <title>\(title)</title>
<style> <style>
:root{--bg:#0a0e1a;--panel:#111827;--panel-2:#1e293b;--line:#1e293b;--line-2:#334155; :root{--bg:#15171c;--card:#1d2026;--fg:#e6e8ec;--muted:#9aa0aa;--line:#2a2e36;--accent:#5b8def;}
--fg:#e2e8f0;--fg-dim:#94a3b8;--fg-faint:#64748b;--accent:#818cf8;--accent-soft:#a5b4fc;}
*{box-sizing:border-box} *{box-sizing:border-box}
body{margin:0;background:var(--bg);color:var(--fg);min-height:100vh; body{margin:0;background:var(--bg);color:var(--fg);font:15px/1.55 -apple-system,BlinkMacSystemFont,"Segoe UI",sans-serif;}
font-family:-apple-system,BlinkMacSystemFont,"Segoe UI",Helvetica,Arial,sans-serif;font-size:13px;line-height:1.55} main{max-width:820px;margin:0 auto;padding:32px 20px 80px;}
.header{padding:14px 24px;background:var(--panel);border-bottom:1px solid var(--line); header h1{margin:0 0 4px;font-size:24px}
display:flex;align-items:center;gap:16px;flex-wrap:wrap} .sub{color:var(--muted);font-size:13px}
.header .htext{min-width:0} .legend{margin-top:12px;display:flex;flex-wrap:wrap;gap:6px}
.header h1{margin:0;font-size:16px;font-weight:700;color:var(--fg)} .chip{display:inline-block;padding:1px 8px;border-radius:10px;color:#fff;font-size:12px;font-weight:600}
.header .meta{font-size:11px;color:var(--fg-faint);margin-top:2px;font-variant-numeric:tabular-nums} .card{background:var(--card);border:1px solid var(--line);border-radius:12px;padding:16px 18px;margin-top:18px}
.legend{margin-left:auto;display:flex;flex-wrap:wrap;gap:6px;justify-content:flex-end} .card h2{margin:0 0 10px;font-size:16px;color:var(--accent)}
.chip{display:inline-block;padding:1px 8px;border-radius:999px;color:#fff;font-size:10px;font-weight:700;white-space:nowrap} .muted{color:var(--muted)}
.band{padding:16px 24px;display:grid;gap:12px} ul{margin:0;padding-left:18px} li{margin:4px 0}
.card{background:var(--panel);border:1px solid var(--line);border-radius:10px;padding:14px 16px} ul.actions{list-style:none;padding-left:0}
.card h2{margin:0 0 8px;font-size:11px;font-weight:700;text-transform:uppercase;letter-spacing:.04em;color:var(--accent-soft)} .ts{color:var(--muted);font-variant-numeric:tabular-nums;font-size:12px;margin-right:4px}
.card p{margin:0 0 8px} blockquote{margin:0 0 12px;padding:8px 12px;border-left:3px solid var(--accent);background:#0e0f13;border-radius:0 8px 8px 0}
.card p:last-child{margin-bottom:0} blockquote cite{display:block;color:var(--muted);font-size:12px;margin-top:4px;font-style:normal}
.card .muted{color:var(--fg-dim);font-size:12px} details.topic{border-top:1px solid var(--line);padding:10px 0}
.card ul{margin:0;padding-left:18px} details.topic > summary{cursor:pointer;font-weight:600;list-style:none}
.card li{margin:5px 0;color:var(--fg)} details.topic > summary::-webkit-details-marker{display:none}
.card .who{color:var(--accent-soft);font-weight:600} .tnum{display:inline-block;min-width:20px;color:var(--accent);font-weight:700}
.card .note{color:var(--fg-faint)} .turns{margin-top:10px}
.split{display:flex;min-height:calc(100vh - 56px)} .turn{margin:6px 0;display:flex;gap:8px;align-items:baseline;flex-wrap:wrap}
.left{flex:0 0 42%;max-width:42%;border-right:1px solid var(--line);overflow-y:auto;padding:16px;background:var(--bg)} .turn .txt{flex:1;min-width:60%}
.right{flex:1;min-width:0;overflow-y:auto;padding:16px;background:var(--panel)} @media print{body{background:#fff;color:#000}.card,blockquote{background:#fff;border-color:#ccc}details.topic{}.chip{border:1px solid #999}}
@media(max-width:900px){.split{flex-direction:column}.left,.right{flex:none;max-width:100%;border-right:none}
.left{border-bottom:1px solid var(--line)}}
.chunk{padding:12px 14px;margin-bottom:8px;background:var(--panel);border:1px solid var(--line);
border-radius:10px;cursor:pointer;transition:border-color .15s,background .15s}
.chunk:hover{border-color:var(--accent)}
.chunk.active{border-color:var(--accent);background:rgba(129,140,248,.06);box-shadow:0 2px 16px rgba(129,140,248,.10)}
.chunk-title{font-size:13px;font-weight:700;color:var(--fg);margin-bottom:4px}
.chunk-time{font-size:10px;color:var(--fg-faint);margin-left:6px;font-weight:500;font-family:"SF Mono",Menlo,monospace}
.chunk-summary{font-size:12px;color:var(--fg-dim);line-height:1.5}
.transcript-line{display:flex;gap:10px;padding:4px 8px;border-radius:6px;line-height:1.6;align-items:baseline;scroll-margin-top:16px}
.transcript-line.hl{background:rgba(129,140,248,.10)}
.ts-badge{flex:0 0 auto;font-family:"SF Mono",Menlo,monospace;font-size:11px;color:var(--accent-soft);min-width:52px}
.ts-text{flex:1;font-size:13px;color:var(--fg)}
.empty{padding:32px 16px;text-align:center;color:var(--fg-faint)}
.foot{padding:14px 24px;color:var(--fg-faint);font-size:11px;border-top:1px solid var(--line)}
@media print{body{background:#fff;color:#000}.header,.right,.left,.card,.chunk{background:#fff;border-color:#ccc}
.split{display:block}.left,.right{max-width:100%}.chip{border:1px solid #999}}
</style></head> </style></head>
<body>\(body) <body><main>\(body)
<div class="foot">Ten31 Transcripts · generated on-device</div> <footer class="sub" style="margin-top:40px">Ten31 Transcripts · generated on-device</footer>
<script> </main></body></html>
function jump(el){
document.querySelectorAll('.chunk.active').forEach(function(x){x.classList.remove('active')});
el.classList.add('active');
var s=+el.dataset.start, e=+el.dataset.end;
var t=document.getElementById('entry-'+s);
if(t) t.scrollIntoView({behavior:'smooth',block:'start'});
document.querySelectorAll('.transcript-line.hl').forEach(function(x){x.classList.remove('hl')});
for(var i=s;i<=e;i++){var x=document.getElementById('entry-'+i); if(x) x.classList.add('hl');}
}
</script>
</body></html>
""" """
} }
} }
-51
View File
@@ -1,51 +0,0 @@
import Foundation
/// How long each diarization *body* chunk should be. Smaller chunks keep fewer
/// simultaneous speakers inside one window Sortformer resolves at most ~4 speakers
/// per chunk, and the dual-channel split already spends the local user on the mic
/// track, so the system (remote) channel is what can saturate on a big call. The
/// cost of going smaller: weaker cross-chunk voiceprints, more cross-chunk speaker
/// splitting (the reconciler re-merges some), and more backend round-trips.
enum ChunkMode: String, CaseIterable, Identifiable, Codable {
case auto, standard, largeGroup, fine
var id: String { rawValue }
var label: String {
switch self {
case .auto: return "Auto (by call size)"
case .standard: return "Standard · 2.5 min"
case .largeGroup: return "Large group · 60 sec"
case .fine: return "Fine · 90 sec"
}
}
/// Fixed body length, or nil for `.auto` (resolved from the participant count).
var fixedBodySeconds: Double? {
switch self {
case .auto: return nil
case .standard: return 150
case .largeGroup: return 60
case .fine: return 90
}
}
/// More than this many detected participants makes `.auto` pick the short body,
/// so one chunk is less likely to exceed Sortformer's ~4-speaker resolution.
static let autoLargeThreshold = 4
/// Resolve the body length in seconds. `.auto` drops to 60s when more than
/// `autoLargeThreshold` participants were detected, else uses the 2.5-min default;
/// with no count available (audio-only) it stays at the 2.5-min default.
func bodySeconds(participantCount: Int?) -> Double {
if let fixed = fixedBodySeconds { return fixed }
if let n = participantCount, n > Self.autoLargeThreshold { return 60 }
return 150
}
/// Overlap margin scaled to the body length (~12%, clamped 815s) so a 60s chunk
/// isn't dominated by a fixed 15s margin while a 2.5-min chunk keeps the full 15s.
static func overlapSeconds(forBody body: Double) -> Double {
max(8, min(15, (body * 0.12).rounded()))
}
}
@@ -99,11 +99,6 @@ final class SessionController: ObservableObject {
/// Bumped each time a start/stop Task is spawned (Task is a value type, so this /// Bumped each time a start/stop Task is spawned (Task is a value type, so this
/// is how `prepareForTermination` detects a newly-spawned transition). /// is how `prepareForTermination` detects a newly-spawned transition).
private var lifecycleGeneration = 0 private var lifecycleGeneration = 0
/// The meeting-name prompt currently on screen, if any, so a quit can end it
/// instead of blocking termination on user input (set in `askMeetingName`).
private weak var activeNamingAlert: NSAlert?
/// Set once `prepareForTermination` begins, so we skip the post-stop naming prompt.
private var isTerminating = false
init(settings: AppSettings) { init(settings: AppSettings) {
self.settings = settings self.settings = settings
@@ -261,9 +256,6 @@ final class SessionController: ObservableObject {
private func startVisual(t0Host: Double, generation: Int, recorder: AudioRecorder) async { private func startVisual(t0Host: Double, generation: Int, recorder: AudioRecorder) async {
guard let capture = pendingCapture else { return } // manual recording audio-only guard let capture = pendingCapture else { return } // manual recording audio-only
pendingCapture = nil pendingCapture = nil
// Honor the per-app adapter switch: if the user turned this app's adapter off,
// skip screen-reading entirely and record audio-only (transcription still runs).
guard settings.adapterEnabled[capture.app.rawValue] ?? true else { return }
guard let vc = VisualCapture(app: capture.app, bundleID: capture.bundleID, guard let vc = VisualCapture(app: capture.app, bundleID: capture.bundleID,
windowID: capture.windowID, t0Host: t0Host) else { return } windowID: capture.windowID, t0Host: t0Host) else { return }
// Register the live capture before the await so a quit (prepareForTermination) // Register the live capture before the await so a quit (prepareForTermination)
@@ -329,9 +321,6 @@ final class SessionController: ObservableObject {
lifecycleTask = Task { lifecycleTask = Task {
let result = await recorder.stop() let result = await recorder.stop()
let visual = await self.stopVisualAndTimeline(result, folder: folder) let visual = await self.stopVisualAndTimeline(result, folder: folder)
// Interactive stop only: ask for a meeting name and give the folder a
// readable name before `finish()` captures it for backend processing.
self.promptMeetingNameAndRename()
self.finish(result, timeline: visual.timeline, selfSpans: visual.selfSpans, visualRan: visual.visualRan) self.finish(result, timeline: visual.timeline, selfSpans: visual.selfSpans, visualRan: visual.visualRan)
} }
} }
@@ -346,18 +335,13 @@ final class SessionController: ObservableObject {
if let folder = currentFolder { if let folder = currentFolder {
writeSelfSpans(spans: selfSpans, result: result, to: folder) writeSelfSpans(spans: selfSpans, result: result, to: folder)
let visualCount = visualRan ? timeline.count : nil // `timeline` is the remote vision segments let visualCount = visualRan ? timeline.count : nil // `timeline` is the remote vision segments
// Re-derive the track URLs from `folder`: a meeting-name rename may have
// moved the session after `result` captured its original paths.
let micURL = folder.appendingPathComponent("mic.wav")
let systemURL = folder.appendingPathComponent("system.wav")
let mixedURL = folder.appendingPathComponent("mixed_mono_16k.wav")
lastSession = SessionInfo( lastSession = SessionInfo(
folder: folder, mixedURL: mixedURL, folder: folder, mixedURL: result.mixedURL,
duration: result.duration, selfSpanCount: selfSpans.count, duration: result.duration, selfSpanCount: selfSpans.count,
visualSegmentCount: visualCount) visualSegmentCount: visualCount)
lastProcess = ProcessInputs( lastProcess = ProcessInputs(
folder: folder, sessionId: folder.lastPathComponent, app: currentLabel, folder: folder, sessionId: folder.lastPathComponent, app: currentLabel,
micURL: micURL, systemURL: systemURL, mixedURL: mixedURL, micURL: result.micURL, systemURL: result.systemURL, mixedURL: result.mixedURL,
timeline: timeline, selfSpans: selfSpans, selfName: settings.selfName, timeline: timeline, selfSpans: selfSpans, selfName: settings.selfName,
systemHealthy: result.systemNote == nil) systemHealthy: result.systemNote == nil)
} }
@@ -391,15 +375,12 @@ final class SessionController: ObservableObject {
let settings = self.settings let settings = self.settings
let pipeline = TranscriptPipeline(baseURL: settings.backendBaseURL, let pipeline = TranscriptPipeline(baseURL: settings.backendBaseURL,
skipTLS: settings.skipTLSVerification, voiceprints: voiceprints) skipTLS: settings.skipTLSVerification, voiceprints: voiceprints)
// Resolve the diarization chunk length from the setting; "Auto" uses the
// participant count the visual capture saw for this session.
let chunkSeconds = settings.chunk.bodySeconds(participantCount: Self.participantCount(in: inputs.folder))
do { do {
let speakers = try await pipeline.process( let speakers = try await pipeline.process(
sessionFolder: inputs.folder, sessionId: inputs.sessionId, app: inputs.app, sessionFolder: inputs.folder, sessionId: inputs.sessionId, app: inputs.app,
micURL: inputs.micURL, systemURL: inputs.systemURL, mixedURL: inputs.mixedURL, micURL: inputs.micURL, systemURL: inputs.systemURL, mixedURL: inputs.mixedURL,
timeline: inputs.timeline, selfSpans: inputs.selfSpans, selfName: inputs.selfName, timeline: inputs.timeline, selfSpans: inputs.selfSpans, selfName: inputs.selfName,
systemHealthy: inputs.systemHealthy, chunkSeconds: chunkSeconds, systemHealthy: inputs.systemHealthy,
progress: { done, total in await MainActor.run { self.transcriptStatus = .processing(done, total) } }) progress: { done, total in await MainActor.run { self.transcriptStatus = .processing(done, total) } })
self.transcriptStatus = .done(speakers: speakers.speakers.count, segments: speakers.segments.count) self.transcriptStatus = .done(speakers: speakers.speakers.count, segments: speakers.segments.count)
try Task.checkCancellation() try Task.checkCancellation()
@@ -432,13 +413,24 @@ final class SessionController: ObservableObject {
guard settings.recapEnabled, !resolved.segments.isEmpty else { return } guard settings.recapEnabled, !resolved.segments.isEmpty else { return }
let analyzer = RecapAnalyzer(llm: llm, model: model) let analyzer = RecapAnalyzer(llm: llm, model: model)
guard let result = try? await analyzer.recap(file: resolved, template: settings.defaultTemplate) else { return } guard let result = try? await analyzer.recap(file: resolved, template: settings.defaultTemplate) else { return }
let title = SessionNaming.recapTitle(app: inputs.app, sessionId: inputs.sessionId) let title = Self.recapTitle(app: inputs.app, sessionId: inputs.sessionId)
try? RecapRenderer.write(file: resolved, result: result, title: title, to: inputs.folder) try? RecapRenderer.write(file: resolved, result: result, title: title, to: inputs.folder)
try? RecapFile(title: title, result: result).write(to: inputs.folder.appendingPathComponent("recap.json")) try? RecapFile(title: title, result: result).write(to: inputs.folder.appendingPathComponent("recap.json"))
let url = inputs.folder.appendingPathComponent("recap.html") let url = inputs.folder.appendingPathComponent("recap.html")
if FileManager.default.fileExists(atPath: url.path) { self.recapURL = url } if FileManager.default.fileExists(atPath: url.path) { self.recapURL = url }
} }
/// Friendly recap title, e.g. "Google Meet call 2026-06-06 11:43".
private static func recapTitle(app: String, sessionId: String) -> String {
let appName = CallDetector.DetectedApp(rawValue: app)?.display ?? app.capitalized
let stamp = sessionId.split(separator: "_").first.map(String.init) ?? sessionId
let parts = stamp.split(separator: "T")
let date = parts.first.map(String.init) ?? ""
let timeBits = parts.count > 1 ? parts[1].split(separator: "-") : []
let time = timeBits.count >= 2 ? "\(timeBits[0]):\(timeBits[1])" : ""
return "\(appName) call — \(date) \(time)".trimmingCharacters(in: .whitespaces)
}
// MARK: - Speaker corrections // MARK: - Speaker corrections
/// True once the last session has a transcribed `speakers.json` to correct. /// True once the last session has a transcribed `speakers.json` to correct.
@@ -536,16 +528,6 @@ final class SessionController: ObservableObject {
} }
} }
/// Detected participant count from a session's visual timeline, for "Auto" chunk
/// sizing. Nil when there's no visual timeline (audio-only) so callers keep the
/// default body length. Counts everyone OCR'd on the call, not just who spoke.
private static func participantCount(in folder: URL) -> Int? {
guard let data = try? Data(contentsOf: folder.appendingPathComponent("visual_timeline.json")),
let vt = try? JSONDecoder().decode(VisualTimeline.self, from: data),
!vt.participants.isEmpty else { return nil }
return vt.participants.count
}
/// The remote (vision) visual-timeline segments saved for a session, if any. /// The remote (vision) visual-timeline segments saved for a session, if any.
private static func remoteTimeline(in folder: URL) -> [VisualTimeline.Segment] { private static func remoteTimeline(in folder: URL) -> [VisualTimeline.Segment] {
guard let data = try? Data(contentsOf: folder.appendingPathComponent("visual_timeline.json")), guard let data = try? Data(contentsOf: folder.appendingPathComponent("visual_timeline.json")),
@@ -586,11 +568,6 @@ final class SessionController: ObservableObject {
/// its WAV headers are finalized before the process exits. Handles quit while /// its WAV headers are finalized before the process exits. Handles quit while
/// `.starting` and `.finishing`, not just `.recording`. /// `.starting` and `.finishing`, not just `.recording`.
func prepareForTermination() async { func prepareForTermination() async {
isTerminating = true
// If the meeting-name prompt is open, end its modal loop so quit isn't blocked
// waiting on the user the session keeps its auto timestamped name. (Falls
// back to the user answering the on-screen dialog if the abort isn't serviced.)
if activeNamingAlert != nil { NSApp.abortModal() }
// Cancel any in-flight backend transcription (audio is already saved; the // Cancel any in-flight backend transcription (audio is already saved; the
// user can resend). The pipeline's checkCancellation + defer clean up chunks. // user can resend). The pipeline's checkCancellation + defer clean up chunks.
processTask?.cancel() processTask?.cancel()
@@ -656,59 +633,6 @@ final class SessionController: ObservableObject {
return f.string(from: Date()) return f.string(from: Date())
} }
/// Ask the user to name the just-finished recording, then rename its folder to
/// a readable `<date>_<name>_<app>` (dropping the HH-MM-SS auto stamp). Skipping
/// or leaving it blank keeps the timestamped name. Must run BEFORE `finish()` so
/// the renamed folder is what flows to backend processing. The recorder and
/// visual capture have both finished by now, so every session file is closed and
/// the move is safe. Never called from the quit path we don't block a quit on
/// a prompt.
private func promptMeetingNameAndRename() {
// A quit can begin while we're finishing don't put a blocking prompt in its
// way; keep the auto timestamped name and let termination drain.
guard !isTerminating, let folder = currentFolder,
let name = askMeetingName() else { return } // nil = skipped / blank
let base = folder.deletingLastPathComponent()
let date = SessionNaming.datePrefix(ofSessionNamed: folder.lastPathComponent)
let fm = FileManager.default
var counter = 0
while counter < 100 {
guard let leaf = SessionNaming.renamedLeaf(
date: date, app: currentLabel, meetingName: name, counter: counter) else { return }
let target = base.appendingPathComponent(leaf, isDirectory: true)
if fm.fileExists(atPath: target.path) { counter += 1; continue } // disambiguate
do {
try fm.moveItem(at: folder, to: target)
currentFolder = target
} catch {
NSLog("Session rename to “\(leaf)” failed: \(error.localizedDescription)") // keep the original folder
}
return
}
NSLog("Session rename: kept “\(folder.lastPathComponent)” — 100 name collisions")
}
/// Modal prompt for a meeting name. Registers the alert so `prepareForTermination`
/// can end it on quit. Returns the trimmed name, or nil if the user skipped, left
/// it empty, or a quit aborted the prompt (caller keeps the auto folder name).
private func askMeetingName() -> String? {
let alert = NSAlert()
alert.messageText = "Name this recording"
alert.informativeText = "Give the meeting a name so its folder is easy to find in your sessions. Leave blank to keep the timestamped name."
alert.addButton(withTitle: "Save") // .alertFirstButtonReturn
alert.addButton(withTitle: "Skip") // .alertSecondButtonReturn
let field = NSTextField(frame: NSRect(x: 0, y: 0, width: 240, height: 24))
field.placeholderString = "Meeting name"
alert.accessoryView = field
alert.window.initialFirstResponder = field
NSApp.activate(ignoringOtherApps: true)
activeNamingAlert = alert
defer { activeNamingAlert = nil }
guard alert.runModal() == .alertFirstButtonReturn else { return nil }
let text = field.stringValue.trimmingCharacters(in: .whitespacesAndNewlines)
return text.isEmpty ? nil : text
}
/// Debug artifact: the channel-verified "self" spans actually sent to the backend /// Debug artifact: the channel-verified "self" spans actually sent to the backend
/// as `self_vad` (mic active AND louder than system). Lets us eyeball self detection. /// as `self_vad` (mic active AND louder than system). Lets us eyeball self detection.
private func writeSelfSpans(spans: [VADSpan], result: RecordingResult, to folder: URL) { private func writeSelfSpans(spans: [VADSpan], result: RecordingResult, to folder: URL) {
@@ -1,71 +0,0 @@
import Foundation
/// Pure helpers for session-folder names. A session folder is created at start
/// with an auto name `<yyyy-MM-dd'T'HH-mm-ss>_<app>`; when the user names the
/// recording on stop it's renamed to `<yyyy-MM-dd>_<name>_<app>` (no HH-MM-SS),
/// which is far easier to scan in `sessions/`. The app label always stays the
/// LAST `_`-separated segment so `SessionController.appLabel(from:)` keeps working
/// even when the meeting name itself contains spaces or underscores.
enum SessionNaming {
/// Filesystem- and parse-safe meeting name: trims, turns path separators into
/// dashes, drops control characters, collapses whitespace runs, removes leading
/// dots (no hidden/`.`/`..` folders), and caps the length. Returns "" if nothing
/// usable is left, which callers treat as "skip the rename".
static func sanitize(_ raw: String) -> String {
var s = raw.trimmingCharacters(in: .whitespacesAndNewlines)
// Path-hostile separators (`/` and the classic Mac `:`, plus `\`) dash.
s = s.components(separatedBy: CharacterSet(charactersIn: "/:\\")).joined(separator: "-")
// Strip control characters outright.
s = s.components(separatedBy: .controlCharacters).joined()
// Collapse internal whitespace runs to single spaces.
s = s.split(whereSeparator: { $0 == " " || $0 == "\t" }).joined(separator: " ")
while s.hasPrefix(".") { s.removeFirst() }
s = s.trimmingCharacters(in: .whitespaces)
if s.count > 60 { s = String(s.prefix(60)).trimmingCharacters(in: .whitespaces) }
return s
}
/// The date prefix of a session leaf name, e.g. `2026-06-17T09-59-48_signal`
/// `2026-06-17`. Already-renamed leaves (`2026-06-17_name_signal`) return the
/// same date, so this is safe to call on either form.
static func datePrefix(ofSessionNamed leaf: String) -> String {
let head = leaf.split(separator: "_").first.map(String.init) ?? leaf
return head.split(separator: "T").first.map(String.init) ?? head
}
/// Compose the renamed leaf `<date>_<name>_<app>`. A positive `counter`
/// disambiguates a collision by suffixing the NAME segment (`<name>-2`) so the
/// trailing `_<app>` stays parseable. Returns nil when the name sanitizes to
/// empty (the caller keeps the auto timestamped name).
static func renamedLeaf(date: String, app: String, meetingName: String, counter: Int = 0) -> String? {
let clean = sanitize(meetingName)
guard !clean.isEmpty else { return nil }
let suffix = counter > 0 ? "-\(counter + 1)" : ""
return "\(date)_\(clean)\(suffix)_\(app)"
}
/// Friendly recap title from a session id, understanding both folder forms:
/// `2026-06-06T11-43-02_meet` "Google Meet call 2026-06-06 11:43"
/// `2026-06-06_Weekly sync_meet` "Weekly sync Google Meet (2026-06-06)"
static func recapTitle(app: String, sessionId: String) -> String {
let appName = CallDetector.DetectedApp(rawValue: app)?.display ?? app.capitalized
var parts = sessionId.split(separator: "_").map(String.init)
if parts.count > 1 { parts.removeLast() } // drop the trailing "_<app>"
let head = parts.first ?? sessionId
let tBits = head.split(separator: "T").map(String.init)
let date = tBits.first ?? head
let time: String = {
guard tBits.count > 1 else { return "" }
let b = tBits[1].split(separator: "-")
return b.count >= 2 ? "\(b[0]):\(b[1])" : ""
}()
let when = [date, time].filter { !$0.isEmpty }.joined(separator: " ")
// Rejoin with "_" the faithful inverse of split("_") so a name that
// itself contained underscores survives the round-trip through the folder name.
let name = parts.count > 1 ? parts[1...].joined(separator: "_") : ""
if name.isEmpty {
return "\(appName) call — \(when)".trimmingCharacters(in: .whitespaces)
}
return "\(name)\(appName) (\(when))".trimmingCharacters(in: .whitespaces)
}
}
@@ -28,7 +28,6 @@ final class TranscriptPipeline {
selfSpans: [VADSpan], selfSpans: [VADSpan],
selfName: String, selfName: String,
systemHealthy: Bool, systemHealthy: Bool,
chunkSeconds: Double = 150,
progress: ((Int, Int) async -> Void)? = nil) async throws -> SpeakersFile { progress: ((Int, Int) async -> Void)? = nil) async throws -> SpeakersFile {
let fm = FileManager.default let fm = FileManager.default
let dual = systemHealthy let dual = systemHealthy
@@ -37,12 +36,7 @@ final class TranscriptPipeline {
let duration = dual let duration = dual
? max(SessionPackager.duration(of: micURL), SessionPackager.duration(of: systemURL)) ? max(SessionPackager.duration(of: micURL), SessionPackager.duration(of: systemURL))
: SessionPackager.duration(of: mixedURL) : SessionPackager.duration(of: mixedURL)
// Chunk to the requested body length; overlap and the single-chunk threshold let plan = SessionPackager.planChunks(durationSec: duration)
// scale with it (a 60s body shouldn't be cut by a fixed 15s margin or stay
// unchunked below the 2.5-min default threshold).
let overlap = ChunkMode.overlapSeconds(forBody: chunkSeconds)
let plan = SessionPackager.planChunks(durationSec: duration, chunkSeconds: chunkSeconds,
overlapSeconds: overlap, thresholdSec: chunkSeconds * 1.2)
// Zero-duration / empty session a valid empty speakers.json, no backend call. // Zero-duration / empty session a valid empty speakers.json, no backend call.
if plan.isEmpty || duration <= 0 { if plan.isEmpty || duration <= 0 {
@@ -56,20 +50,13 @@ final class TranscriptPipeline {
try? fm.createDirectory(at: chunksDir, withIntermediateDirectories: true) try? fm.createDirectory(at: chunksDir, withIntermediateDirectories: true)
defer { try? fm.removeItem(at: chunksDir) } // cleanup on success OR throw defer { try? fm.removeItem(at: chunksDir) } // cleanup on success OR throw
// Defensive: drop any visual span covering most of the call in one unbroken
// segment the signature of a stuck/false active-speaker cue (e.g. a solid
// camera-off tile read as "speaking" the whole call). Such a span would
// dominate the backend's name attribution and collapse every voice onto one
// name. Also salvages sessions captured before the adapter fix landed.
let vis = Self.dropStuckSpans(timeline, duration: duration)
// Start from stored voiceprints; accumulate this call's prints across chunks // Start from stored voiceprints; accumulate this call's prints across chunks
// for within-call unification (the store only persists high-confidence ones). // for within-call unification (the store only persists high-confidence ones).
var known = voiceprints.knownVoiceprints() var known = voiceprints.knownVoiceprints()
var results: [TranscriptAssembler.ChunkResult] = [] var results: [TranscriptAssembler.ChunkResult] = []
// Mono fallback needs self folded into the timeline; dual sends it separately. // Mono fallback needs self folded into the timeline; dual sends it separately.
let monoTimeline = dual ? vis let monoTimeline = dual ? timeline
: vis + Self.timeline(fromSelfSpans: selfSpans, selfName: selfName) : timeline + Self.timeline(fromSelfSpans: selfSpans, selfName: selfName)
for chunk in plan { for chunk in plan {
try Task.checkCancellation() try Task.checkCancellation()
@@ -83,7 +70,7 @@ final class TranscriptPipeline {
try SessionPackager.sliceAudio(from: micURL, startSec: chunk.start, endSec: chunk.end, to: micChunk) try SessionPackager.sliceAudio(from: micURL, startSec: chunk.start, endSec: chunk.end, to: micChunk)
try SessionPackager.sliceAudio(from: systemURL, startSec: chunk.start, endSec: chunk.end, to: sysChunk) try SessionPackager.sliceAudio(from: systemURL, startSec: chunk.start, endSec: chunk.end, to: sysChunk)
guard fm.fileExists(atPath: micChunk.path), fm.fileExists(atPath: sysChunk.path) else { continue } guard fm.fileExists(atPath: micChunk.path), fm.fileExists(atPath: sysChunk.path) else { continue }
let timelineData = try SessionPackager.rebasedTimelineData(vis, start: chunk.start, end: chunk.end) let timelineData = try SessionPackager.rebasedTimelineData(timeline, start: chunk.start, end: chunk.end)
let selfVadData = try SessionPackager.rebasedSelfVadData(selfSpans, start: chunk.start, end: chunk.end) let selfVadData = try SessionPackager.rebasedSelfVadData(selfSpans, start: chunk.start, end: chunk.end)
response = try await client.labelMergeDual( response = try await client.labelMergeDual(
micURL: micChunk, systemURL: sysChunk, selfName: selfName, selfVad: selfVadData, micURL: micChunk, systemURL: sysChunk, selfName: selfName, selfVad: selfVadData,
@@ -121,19 +108,9 @@ final class TranscriptPipeline {
return assembled.speakersFile return assembled.speakersFile
} }
/// Build the `label-merge` timeline from mic-VAD self spans; the visual /// Build the `label-merge` timeline from mic-VAD self spans (Phase 1/2). Once
/// adapters' segments are merged in alongside these. /// the visual adapters land (Phase 34), their segments are merged in too.
static func timeline(fromSelfSpans spans: [VADSpan], selfName: String) -> [VisualTimeline.Segment] { static func timeline(fromSelfSpans spans: [VADSpan], selfName: String) -> [VisualTimeline.Segment] {
spans.map { .init(start: $0.start, end: $0.end, name: selfName, confidence: $0.confidence, source: "mic_vad") } spans.map { .init(start: $0.start, end: $0.end, name: selfName, confidence: $0.confidence, source: "mic_vad") }
} }
/// Drop visual (vision-source) spans whose single unbroken duration covers at
/// least `maxFraction` of the whole call no one legitimately speaks that long
/// without a break, so it's a stuck/false cue. Self spans (mic_vad) are kept.
static func dropStuckSpans(_ timeline: [VisualTimeline.Segment], duration: Double,
maxFraction: Double = 0.6) -> [VisualTimeline.Segment] {
guard duration > 0 else { return timeline }
let limit = maxFraction * duration
return timeline.filter { $0.source != "vision" || ($0.end - $0.start) < limit }
}
} }
+4 -26
View File
@@ -3,8 +3,8 @@ import Combine
/// User-facing settings, persisted to `UserDefaults`. /// User-facing settings, persisted to `UserDefaults`.
/// ///
/// Covers the backend host + TLS handling, output folder, your name, chunk /// Phase 0 scope: backend host + TLS-skip, output folder, and adapter toggles.
/// length, per-app adapter toggles, and the auto-record/auto-send/recap flags. /// The adapter toggles persist but do nothing yet (adapters arrive in Phase 34).
@MainActor @MainActor
final class AppSettings: ObservableObject { final class AppSettings: ObservableObject {
@@ -60,15 +60,6 @@ final class AppSettings: ObservableObject {
didSet { defaults.set(reconcileSpeakers, forKey: Keys.reconcileSpeakers) } didSet { defaults.set(reconcileSpeakers, forKey: Keys.reconcileSpeakers) }
} }
/// Diarization chunk length (raw value of `ChunkMode`). `.auto` shrinks chunks on
/// large calls so a window is less likely to exceed Sortformer's ~4-speaker cap.
@Published var chunkMode: String {
didSet { defaults.set(chunkMode, forKey: Keys.chunkMode) }
}
/// Typed accessor for `chunkMode`.
var chunk: ChunkMode { ChunkMode(rawValue: chunkMode) ?? .auto }
/// User-editable recap templates (takeaways categories per meeting type). /// User-editable recap templates (takeaways categories per meeting type).
@Published var recapTemplates: [RecapTemplate] { @Published var recapTemplates: [RecapTemplate] {
didSet { persist(recapTemplates, forKey: Keys.recapTemplates) } didSet { persist(recapTemplates, forKey: Keys.recapTemplates) }
@@ -92,24 +83,13 @@ final class AppSettings: ObservableObject {
private let defaults: UserDefaults private let defaults: UserDefaults
/// Neutral placeholder. The real (private LAN) backend host is never committed
/// it's entered in Settings (persisted to UserDefaults) or seeded from the
/// `SPARK_BACKEND_URL` env var for dev/CI/harness runs.
static let defaultBackendURL = "https://your-spark-backend.local"
init(defaults: UserDefaults = .standard) { init(defaults: UserDefaults = .standard) {
self.defaults = defaults self.defaults = defaults
// Precedence: a value the user saved in Settings wins; else the env var
// (handy when launching from Xcode/terminal); else the placeholder.
self.backendBaseURL = defaults.string(forKey: Keys.backendBaseURL) self.backendBaseURL = defaults.string(forKey: Keys.backendBaseURL)
?? ProcessInfo.processInfo.environment["SPARK_BACKEND_URL"] ?? "https://immense-voyage.local:62419"
?? Self.defaultBackendURL
// Off by default: install the Start9 Root CA in the System keychain and the self.skipTLSVerification = defaults.object(forKey: Keys.skipTLS) as? Bool ?? true
// backend's cert validates normally. The bypass is an opt-in escape hatch and,
// when on, is scoped to the configured host (see `InsecureTrustDelegate`).
self.skipTLSVerification = defaults.object(forKey: Keys.skipTLS) as? Bool ?? false
self.outputFolderPath = defaults.string(forKey: Keys.outputFolder) self.outputFolderPath = defaults.string(forKey: Keys.outputFolder)
?? "~/Ten31Transcripts" ?? "~/Ten31Transcripts"
@@ -124,7 +104,6 @@ final class AppSettings: ObservableObject {
self.autoSendOnStop = defaults.object(forKey: Keys.autoSend) as? Bool ?? false self.autoSendOnStop = defaults.object(forKey: Keys.autoSend) as? Bool ?? false
self.recapEnabled = defaults.object(forKey: Keys.recapEnabled) as? Bool ?? true self.recapEnabled = defaults.object(forKey: Keys.recapEnabled) as? Bool ?? true
self.reconcileSpeakers = defaults.object(forKey: Keys.reconcileSpeakers) as? Bool ?? true self.reconcileSpeakers = defaults.object(forKey: Keys.reconcileSpeakers) as? Bool ?? true
self.chunkMode = defaults.string(forKey: Keys.chunkMode) ?? ChunkMode.auto.rawValue
let loaded = (defaults.data(forKey: Keys.recapTemplates)) let loaded = (defaults.data(forKey: Keys.recapTemplates))
.flatMap { try? JSONDecoder().decode([RecapTemplate].self, from: $0) } .flatMap { try? JSONDecoder().decode([RecapTemplate].self, from: $0) }
@@ -147,7 +126,6 @@ final class AppSettings: ObservableObject {
static let autoSend = "autoSendOnStop" static let autoSend = "autoSendOnStop"
static let recapEnabled = "recapEnabled" static let recapEnabled = "recapEnabled"
static let reconcileSpeakers = "reconcileSpeakers" static let reconcileSpeakers = "reconcileSpeakers"
static let chunkMode = "chunkMode"
static let recapTemplates = "recapTemplates" static let recapTemplates = "recapTemplates"
static let defaultTemplate = "defaultTemplateId" static let defaultTemplate = "defaultTemplateId"
} }
+2
View File
@@ -30,6 +30,8 @@
<string>Ten31</string> <string>Ten31</string>
<key>NSMicrophoneUsageDescription</key> <key>NSMicrophoneUsageDescription</key>
<string>Ten31 Transcripts records your microphone during calls to build the local audio track.</string> <string>Ten31 Transcripts records your microphone during calls to build the local audio track.</string>
<key>NSAppleEventsUsageDescription</key>
<string>Ten31 Transcripts reads the active browser tab's URL to detect Google Meet calls.</string>
<key>NSLocalNetworkUsageDescription</key> <key>NSLocalNetworkUsageDescription</key>
<string>Ten31 Transcripts connects to your SparkControl server on the local network.</string> <string>Ten31 Transcripts connects to your SparkControl server on the local network.</string>
<key>NSAppTransportSecurity</key> <key>NSAppTransportSecurity</key>
-29
View File
@@ -31,35 +31,6 @@ final class EditorWindow {
} }
} }
/// Hosts the app Settings in a standalone resizable window. Far roomier than the
/// old in-popover NavigationLink, which cramped the form into the 320pt menu-bar
/// panel and hid most controls below a non-obvious scroll.
@MainActor
final class SettingsWindow {
static let shared = SettingsWindow()
private var window: NSWindow?
func show(settings: AppSettings) {
if let window {
NSApp.activate(ignoringOtherApps: true)
window.makeKeyAndOrderFront(nil)
return
}
let w = NSWindow(
contentRect: NSRect(x: 0, y: 0, width: 520, height: 660),
styleMask: [.titled, .closable, .resizable, .miniaturizable],
backing: .buffered, defer: false)
w.title = "Settings"
w.isReleasedWhenClosed = false
w.center()
w.contentViewController = NSHostingController(
rootView: SettingsView().environmentObject(settings))
window = w
NSApp.activate(ignoringOtherApps: true)
w.makeKeyAndOrderFront(nil)
}
}
/// Hosts the recap-templates manager in its own resizable window. /// Hosts the recap-templates manager in its own resizable window.
@MainActor @MainActor
final class TemplatesWindow { final class TemplatesWindow {
+6 -2
View File
@@ -10,6 +10,7 @@ struct MenuBarView: View {
@EnvironmentObject private var session: SessionController @EnvironmentObject private var session: SessionController
var body: some View { var body: some View {
NavigationStack {
VStack(alignment: .leading, spacing: 12) { VStack(alignment: .leading, spacing: 12) {
header header
Divider() Divider()
@@ -23,6 +24,7 @@ struct MenuBarView: View {
} }
.padding(14) .padding(14)
.frame(width: 320) .frame(width: 320)
}
.onAppear { permissions.refresh() } .onAppear { permissions.refresh() }
.task { await refreshHealth() } .task { await refreshHealth() }
} }
@@ -173,7 +175,7 @@ struct MenuBarView: View {
private var header: some View { private var header: some View {
VStack(alignment: .leading, spacing: 2) { VStack(alignment: .leading, spacing: 2) {
Text("Ten31 Transcripts").font(.headline) Text("Ten31 Transcripts").font(.headline)
Text("Setup & status") Text("Phase 0 · setup & status")
.font(.caption) .font(.caption)
.foregroundStyle(.secondary) .foregroundStyle(.secondary)
} }
@@ -225,7 +227,9 @@ struct MenuBarView: View {
private var footer: some View { private var footer: some View {
HStack { HStack {
Button("Settings…") { SettingsWindow.shared.show(settings: settings) } NavigationLink("Settings…") {
SettingsView()
}
Spacer() Spacer()
Button("Quit") { NSApplication.shared.terminate(nil) } Button("Quit") { NSApplication.shared.terminate(nil) }
} }
+5 -31
View File
@@ -7,21 +7,6 @@ struct SettingsView: View {
var body: some View { var body: some View {
Form { Form {
Section("Your name") {
TextField("Your name", text: $settings.selfName)
.textFieldStyle(.roundedBorder)
if isDefaultName {
Label("Still set to the default. Enter your real name so your own voice is labeled correctly — and so the AI never gives your name to someone else.",
systemImage: "exclamationmark.triangle.fill")
.font(.caption)
.foregroundStyle(.orange)
} else {
Text("Labels your microphone channel as you in every transcript, and reserves this name so its never assigned to another speaker.")
.font(.caption)
.foregroundStyle(.secondary)
}
}
Section("SparkControl backend") { Section("SparkControl backend") {
TextField("Base URL", text: $settings.backendBaseURL) TextField("Base URL", text: $settings.backendBaseURL)
.textFieldStyle(.roundedBorder) .textFieldStyle(.roundedBorder)
@@ -37,14 +22,10 @@ struct SettingsView: View {
} }
Section("Transcription") { Section("Transcription") {
TextField("Your name", text: $settings.selfName)
.textFieldStyle(.roundedBorder)
Toggle("Auto-send recordings to backend", isOn: $settings.autoSendOnStop) Toggle("Auto-send recordings to backend", isOn: $settings.autoSendOnStop)
Toggle("Reconcile speakers (merge splits + name from content)", isOn: $settings.reconcileSpeakers) Toggle("Reconcile speakers (merge splits + name from content)", isOn: $settings.reconcileSpeakers)
Picker("Chunk length", selection: $settings.chunkMode) {
ForEach(ChunkMode.allCases) { Text($0.label).tag($0.rawValue) }
}
Text("How finely audio is split for diarization. Shorter chunks keep fewer simultaneous speakers per window (the diarizer resolves ~4 at a time), at some cost to speed and voice matching. Auto uses 60-sec chunks when more than \(ChunkMode.autoLargeThreshold) people are detected on the call, else 2.5 min.")
.font(.caption)
.foregroundStyle(.secondary)
Toggle("Build readable recap (topics + highlights)", isOn: $settings.recapEnabled) Toggle("Build readable recap (topics + highlights)", isOn: $settings.recapEnabled)
HStack { HStack {
Picker("Default recap template", selection: $settings.defaultTemplateId) { Picker("Default recap template", selection: $settings.defaultTemplateId) {
@@ -52,7 +33,7 @@ struct SettingsView: View {
} }
Button("Manage…") { TemplatesWindow.shared.show(settings: settings) } Button("Manage…") { TemplatesWindow.shared.show(settings: settings) }
} }
Text("Auto-send transcribes on stop; the recap writes transcript.md + recap.html. Templates define the takeaways categories per meeting type.") Text("Your name labels your mic channel. Auto-send transcribes on stop; the recap writes transcript.md + recap.html. Templates define the takeaways categories per meeting type.")
.font(.caption) .font(.caption)
.foregroundStyle(.secondary) .foregroundStyle(.secondary)
} }
@@ -69,7 +50,7 @@ struct SettingsView: View {
} }
Section("Adapters") { Section("Adapters") {
Text("Screen-reading for active-speaker cues. Turn one off to record that app audio-only — transcription still runs, but speakers arent identified from the screen.") Text("Inert in Phase 0 — these toggles only persist for now.")
.font(.caption) .font(.caption)
.foregroundStyle(.secondary) .foregroundStyle(.secondary)
ForEach(AppSettings.adapterKeys, id: \.key) { adapter in ForEach(AppSettings.adapterKeys, id: \.key) { adapter in
@@ -78,17 +59,10 @@ struct SettingsView: View {
} }
} }
.formStyle(.grouped) .formStyle(.grouped)
.frame(minWidth: 460, idealWidth: 520, maxWidth: .infinity, .frame(width: 320)
minHeight: 520, idealHeight: 660, maxHeight: .infinity)
.navigationTitle("Settings") .navigationTitle("Settings")
} }
/// True while the user still has the placeholder name drives the inline nudge.
private var isDefaultName: Bool {
let n = settings.selfName.trimmingCharacters(in: .whitespacesAndNewlines)
return n.isEmpty || n.caseInsensitiveCompare("Me") == .orderedSame
}
private func binding(for key: String) -> Binding<Bool> { private func binding(for key: String) -> Binding<Bool> {
Binding( Binding(
get: { settings.adapterEnabled[key] ?? true }, get: { settings.adapterEnabled[key] ?? true },
@@ -120,43 +120,6 @@ struct FrameSampler {
return points return points
} }
/// Grid-sampled saturated pixels that lie on a THIN structure (a non-saturated
/// pixel within `edgeGap` on some axis) the coloured counterpart of
/// `thinWhitePoints`. This keeps a thin speaking BORDER/ring/pill but drops the
/// solid interior of a colour FILL (e.g. Meet's orange/magenta camera-off avatar
/// tiles), whose pixels are surrounded by the same colour. Pair with `hueRange`
/// to keep only the cue's colour (Meet's blue ring) and reject the thin edges a
/// solid tile still has against the background (orange/magenta boundaries).
func thinColoredPoints(threshold: Double = 0.35, minBrightness: Double = 60,
hueRange: ClosedRange<Double>? = nil,
edgeGap: Int = 6, gridStep: Int = 4) -> [CGPoint] {
func isCue(_ x: Int, _ y: Int) -> Bool {
guard x >= 0, x < width, y >= 0, y < height else { return false }
let i = (y * width + x) * 4
let r = Double(pixels[i]), g = Double(pixels[i + 1]), b = Double(pixels[i + 2])
let mx = max(r, g, b), mn = min(r, g, b)
let sat = mx > 0 ? (mx - mn) / mx : 0
guard sat > threshold, mx > minBrightness else { return false }
if let hr = hueRange { return hr.contains(Self.hueDegrees(r, g, b, mx, mn)) }
return true
}
var points: [CGPoint] = []
var y = edgeGap
while y < height - edgeGap {
var x = edgeGap
while x < width - edgeGap {
if isCue(x, y) {
let thin = !isCue(x - edgeGap, y) || !isCue(x + edgeGap, y)
|| !isCue(x, y - edgeGap) || !isCue(x, y + edgeGap)
if thin { points.append(CGPoint(x: x, y: y)) }
}
x += gridStep
}
y += gridStep
}
return points
}
/// HSV hue in degrees (0360) from RGB and its precomputed max/min channels. /// HSV hue in degrees (0360) from RGB and its precomputed max/min channels.
private static func hueDegrees(_ r: Double, _ g: Double, _ b: Double, _ mx: Double, _ mn: Double) -> Double { private static func hueDegrees(_ r: Double, _ g: Double, _ b: Double, _ mx: Double, _ mn: Double) -> Double {
let d = mx - mn let d = mx - mn
+2 -41
View File
@@ -35,21 +35,11 @@ struct GridCallAnalyzer {
var colorSaturation: Double = 0.5 var colorSaturation: Double = 0.5
var colorMinBrightness: Double = 60 var colorMinBrightness: Double = 60
var colorHueRange: ClosedRange<Double>? = nil var colorHueRange: ClosedRange<Double>? = nil
// When true, the coloured highlight is detected from THIN edges only (drops
// solid colour fills like Meet's camera-off avatar tiles). Pair with a tight
// `colorHueRange` so a solid tile's thin background boundary is rejected too.
var coloredBorderThinOnly = false
var minTextConfidence: Float = 0.3 var minTextConfidence: Float = 0.3
var maxNameLength = 40 var maxNameLength = 40
var minHighlightPoints = 6 var minHighlightPoints = 6
var highlightShareOfMax = 0.35 var highlightShareOfMax = 0.35
var minRingSpan: Double = 60 // a speaking border spans a sizable box, not a speck var minRingSpan: Double = 60 // a speaking border spans a sizable box, not a speck
// A real active-speaker cue is a thin RING (border) with an EMPTY interior.
// A solid camera-off avatar tile (Meet's orange/magenta fill) or a screen-share
// fill is a filled BLOB its highlight points spread through the interior. Reject
// a component when more than this fraction of its points fall in the central
// 60%×60% of its bbox (a hollow ring 0; a solid fill 0.36). Set 1 to disable.
var maxInteriorFill: Double = 0.2
} }
var config = Config() var config = Config()
@@ -78,11 +68,7 @@ struct GridCallAnalyzer {
// Highlight pixels: coloured (saturated) and/or white (thin near-white). // Highlight pixels: coloured (saturated) and/or white (thin near-white).
var highlight: [CGPoint] = [] var highlight: [CGPoint] = []
if config.detectColoredBorder { if config.detectColoredBorder {
highlight += config.coloredBorderThinOnly highlight += sampler.saturatedPoints(threshold: config.colorSaturation,
? sampler.thinColoredPoints(threshold: config.colorSaturation,
minBrightness: config.colorMinBrightness,
hueRange: config.colorHueRange)
: sampler.saturatedPoints(threshold: config.colorSaturation,
minBrightness: config.colorMinBrightness, minBrightness: config.colorMinBrightness,
hueRange: config.colorHueRange) hueRange: config.colorHueRange)
} }
@@ -103,8 +89,7 @@ struct GridCallAnalyzer {
var speakingBBox: [Int: CGRect] = [:] // tile index -> the ring bbox marking it speaking var speakingBBox: [Int: CGRect] = [:] // tile index -> the ring bbox marking it speaking
for ring in rings where ring.count >= config.minHighlightPoints { for ring in rings where ring.count >= config.minHighlightPoints {
let bb = Self.boundingBox(ring) let bb = Self.boundingBox(ring)
guard bb.width >= config.minRingSpan, bb.height >= config.minRingSpan else { continue } // a ring, not a speck guard bb.width >= config.minRingSpan, bb.height >= config.minRingSpan else { continue } // a ring, not a blob
guard Self.isHollow(ring, bbox: bb, maxInteriorFill: config.maxInteriorFill) else { continue } // a ring, not a filled tile
for (i, tile) in tiles.enumerated() where bb.contains(CGPoint(x: tile.textRect.midX, y: tile.textRect.midY)) { for (i, tile) in tiles.enumerated() where bb.contains(CGPoint(x: tile.textRect.midX, y: tile.textRect.midY)) {
speakingBBox[i] = bb speakingBBox[i] = bb
} }
@@ -143,18 +128,6 @@ struct GridCallAnalyzer {
return Array(groups.values) return Array(groups.values)
} }
/// True if `pts` form a hollow ring (border) rather than a filled blob: at most
/// `maxInteriorFill` of the points fall in the central 60%×60% of `bbox`. A thin
/// border has an empty interior ( 0); a solid camera-off avatar tile or a
/// screen-share fill spreads points through the interior ( 0.36). Disabled when
/// `maxInteriorFill >= 1`.
static func isHollow(_ pts: [CGPoint], bbox: CGRect, maxInteriorFill: Double) -> Bool {
guard maxInteriorFill < 1, !pts.isEmpty else { return true }
let inner = bbox.insetBy(dx: bbox.width * 0.2, dy: bbox.height * 0.2)
let innerCount = pts.reduce(into: 0) { if inner.contains($1) { $0 += 1 } }
return Double(innerCount) / Double(pts.count) <= maxInteriorFill
}
static func boundingBox(_ pts: [CGPoint]) -> CGRect { static func boundingBox(_ pts: [CGPoint]) -> CGRect {
var minX = Double.greatestFiniteMagnitude, minY = minX, maxX = -minX, maxY = -minX var minX = Double.greatestFiniteMagnitude, minY = minX, maxX = -minX, maxY = -minX
for p in pts { minX = min(minX, p.x); minY = min(minY, p.y); maxX = max(maxX, p.x); maxY = max(maxY, p.y) } for p in pts { minX = min(minX, p.x); minY = min(minY, p.y); maxX = max(maxX, p.x); maxY = max(maxY, p.y) }
@@ -193,11 +166,7 @@ struct GridCallAnalyzer {
} }
private func cleaned(_ s: String) -> String { private func cleaned(_ s: String) -> String {
// Trim whitespace and any trailing punctuation OCR tacks on, so "Mark." folds
// into "Mark" rather than becoming a separate phantom speaker.
s.trimmingCharacters(in: .whitespacesAndNewlines) s.trimmingCharacters(in: .whitespacesAndNewlines)
.trimmingCharacters(in: CharacterSet(charactersIn: ".,;:·•-"))
.trimmingCharacters(in: .whitespacesAndNewlines)
} }
/// True if `s` looks like a participant name label rather than UI chrome. Call /// True if `s` looks like a participant name label rather than UI chrome. Call
@@ -212,14 +181,6 @@ struct GridCallAnalyzer {
if s.rangeOfCharacter(from: CharacterSet(charactersIn: "@:/\\|+*=<>#0123456789")) != nil { if s.rangeOfCharacter(from: CharacterSet(charactersIn: "@:/\\|+*=<>#0123456789")) != nil {
return false return false
} }
// Reject domain-like screen-share text (e.g. "WERUNBTC.COM", OCR'd "WERUNBTC.GOM"):
// a token whose final dotted segment is a 24 letter suffix. Real names don't end
// in a TLD; this keeps "Cait's Phone" and initials like "MO".
let lower = s.lowercased()
if let dot = lower.lastIndex(of: "."), lower.index(after: dot) < lower.endIndex {
let suffix = lower[lower.index(after: dot)...]
if (2...4).contains(suffix.count) && suffix.allSatisfy({ $0.isLetter }) { return false }
}
let words = s.split(separator: " ") let words = s.split(separator: " ")
guard (1...3).contains(words.count) else { return false } guard (1...3).contains(words.count) else { return false }
let allowed = CharacterSet.letters.union(CharacterSet(charactersIn: "'.-")) let allowed = CharacterSet.letters.union(CharacterSet(charactersIn: "'.-"))
@@ -15,15 +15,9 @@ final class TimelineBuilder {
private let closeFrames: Int private let closeFrames: Int
private var aliases: [String: String] = [:] // normalized variant -> canonical private var aliases: [String: String] = [:] // normalized variant -> canonical
private var states: [String: NameState] = [:] private var states: [String: NameState] = [:]
private var observed: Set<String> = [] // every tile name seen (speaking or not)
private var lastFrameT: Double = 0 private var lastFrameT: Double = 0
private(set) var segments: [VisualTimeline.Segment] = [] private(set) var segments: [VisualTimeline.Segment] = []
/// Every distinct participant name the adapter has OCR'd, whether or not they were
/// ever detected speaking the call-size signal (drives "Auto" chunk sizing and a
/// complete participant roster, since speaking-detection is intentionally sparse).
var observedNames: [String] { observed.sorted() }
init(openFrames: Int = 2, closeFrames: Int = 2) { init(openFrames: Int = 2, closeFrames: Int = 2) {
self.openFrames = max(1, openFrames) self.openFrames = max(1, openFrames)
self.closeFrames = max(1, closeFrames) self.closeFrames = max(1, closeFrames)
@@ -40,9 +34,6 @@ final class TimelineBuilder {
func ingest(_ observations: [SpeakerObservation], at t: TimeInterval) { func ingest(_ observations: [SpeakerObservation], at t: TimeInterval) {
lastFrameT = t lastFrameT = t
// Record every tile seen (speaking or not) for the participant roster / call size.
for obs in observations where !obs.name.isEmpty { observed.insert(canonical(obs.name)) }
// Best confidence per canonical name that is speaking this frame. // Best confidence per canonical name that is speaking this frame.
var speaking: [String: Double] = [:] var speaking: [String: Double] = [:]
for obs in observations where obs.speaking && !obs.name.isEmpty { for obs in observations where obs.speaking && !obs.name.isEmpty {
@@ -102,57 +93,9 @@ final class TimelineBuilder {
closeSegment(name: name, state: st) closeSegment(name: name, state: st)
states[name]?.open = false states[name]?.open = false
} }
segments = Self.canonicalizeByFrequency(segments)
segments.sort { $0.start < $1.start } segments.sort { $0.start < $1.start }
} }
/// Fold rare OCR misspellings into the dominant name they're a typo of: a name with
/// little total time is remapped to a much longer-running name with the same initial
/// within a small edit distance (e.g. "Matt Odel"/"MattOdell"/"Mare" "Matt Odell"/
/// "Mark"). Conservative by design it won't merge two well-attested speakers, only
/// a transient variant into its clearly-dominant canonical. Pure/testable.
static func canonicalizeByFrequency(_ segs: [VisualTimeline.Segment],
minorMaxSec: Double = 5, dominanceRatio: Double = 8,
maxEdits: Int = 2) -> [VisualTimeline.Segment] {
var dur: [String: Double] = [:]
for s in segs { dur[s.name, default: 0] += s.end - s.start }
let names = Array(dur.keys)
var remap: [String: String] = [:]
for minor in names {
let md = dur[minor]!
guard md <= minorMaxSec, let mInit = minor.first else { continue }
var best: String?, bestDur = 0.0
for major in names where major != minor {
let Md = dur[major]!
guard Md >= md * dominanceRatio, Md > bestDur, major.first == mInit else { continue }
if levenshtein(minor.lowercased(), major.lowercased()) <= maxEdits { best = major; bestDur = Md }
}
if let b = best { remap[minor] = b }
}
guard !remap.isEmpty else { return segs }
return segs.map { s in
remap[s.name].map { VisualTimeline.Segment(start: s.start, end: s.end, name: $0,
confidence: s.confidence, source: s.source) } ?? s
}
}
/// Levenshtein edit distance (small strings names).
static func levenshtein(_ a: String, _ b: String) -> Int {
let x = Array(a), y = Array(b)
if x.isEmpty { return y.count }; if y.isEmpty { return x.count }
var prev = Array(0...y.count)
var cur = [Int](repeating: 0, count: y.count + 1)
for i in 1...x.count {
cur[0] = i
for j in 1...y.count {
cur[j] = x[i-1] == y[j-1] ? prev[j-1]
: Swift.min(prev[j-1], prev[j], cur[j-1]) + 1
}
swap(&prev, &cur)
}
return prev[y.count]
}
// MARK: - Internal // MARK: - Internal
private struct NameState { private struct NameState {
+1 -4
View File
@@ -75,10 +75,7 @@ final class VisualCapture {
}, to: durationSec) }, to: durationSec)
let artifact = (vision + selfSegs).sorted { $0.start < $1.start } let artifact = (vision + selfSegs).sorted { $0.start < $1.start }
// Roster = everyone OCR'd (speaking or not) the names that produced segments, let names = Set(artifact.map { $0.name })
// so the participant count reflects true call size even when few people were
// detected speaking. Drives "Auto" chunk sizing downstream.
let names = Set(artifact.map { $0.name }).union(observer.participantNames())
let participants = names.sorted().map { let participants = names.sorted().map {
VisualTimeline.Participant(name: $0, isSelf: $0 == selfName ? true : nil, aliases: nil) VisualTimeline.Participant(name: $0, isSelf: $0 == selfName ? true : nil, aliases: nil)
} }
@@ -114,10 +114,6 @@ final class VisualObserver: NSObject, SCStreamDelegate, SCStreamOutput {
queue.sync { builder.mergeSelfSpans(spans, selfName: selfName) } queue.sync { builder.mergeSelfSpans(spans, selfName: selfName) }
} }
/// Every distinct participant name OCR'd over the session (read on the builder's
/// queue; safe to call after `stop`).
func participantNames() -> [String] { queue.sync { builder.observedNames } }
// MARK: - SCStreamOutput (on `queue`) // MARK: - SCStreamOutput (on `queue`)
func stream(_ stream: SCStream, didOutputSampleBuffer sampleBuffer: CMSampleBuffer, func stream(_ stream: SCStream, didOutputSampleBuffer sampleBuffer: CMSampleBuffer,
+1 -1
View File
@@ -62,7 +62,7 @@ struct VisualTimeline: Codable {
} }
/// The flat array `label-merge` wants: `[{start,end,name,confidence}]`, /// The flat array `label-merge` wants: `[{start,end,name,confidence}]`,
/// dropping `source`. Slice/rebase to chunk-local seconds happens at chunking time. /// dropping `source`. Slice/rebase to chunk-local seconds happens in Phase 5.
func flatTimelineData() throws -> Data { func flatTimelineData() throws -> Data {
let flat = segments.map { seg -> [String: Any] in let flat = segments.map { seg -> [String: Any] in
["start": seg.start, "end": seg.end, "name": seg.name, "confidence": seg.confidence] ["start": seg.start, "end": seg.end, "name": seg.name, "confidence": seg.confidence]
@@ -138,37 +138,16 @@ final class GridCallAnalyzerTests: XCTestCase {
func testNameFilterAgainstRealMeetOCR() { func testNameFilterAgainstRealMeetOCR() {
// The exact strings OCR pulled from a real Meet session only the first // The exact strings OCR pulled from a real Meet session only the first
// group are participants; the rest are UI chrome that must NOT become speakers. // group are participants; the rest are UI chrome that must NOT become speakers.
let names = ["Grant Gilliam", "Caitlyn Viggiano", "Cait's Phone", "Grant", "Me", "Matt Odell"] let names = ["Grant Gilliam", "Caitlyn Viggiano", "Cait's Phone", "Grant", "Me"]
let junk = ["11:43 AM | rvo-rmjg-rdq", "@ Embassy Er", "Admit 1 guest", let junk = ["11:43 AM | rvo-rmjg-rdq", "@ Embassy Er", "Admit 1 guest",
"Joined as grant.gilliam@gmail.com", "Others may see your video differently", "Joined as grant.gilliam@gmail.com", "Others may see your video differently",
"Others might still see your full video.", "Your meeting's ready", "efforot", "Others might still see your full video.", "Your meeting's ready", "efforot",
"g* Add others", "g+ Add others", "meet.google.com/rvo-rmjg-rdq", "g* Add others", "g+ Add others", "meet.google.com/rvo-rmjg-rdq",
"permission before they can join.", "the meeting", "G", "permission before they can join.", "the meeting", "G"]
// Screen-share domain text OCR'd as a name (incl. OCR'd TLDs).
"WERUNBTC.COM", "WERUNBTG.COM", "WERUNBTC.GOM"]
for n in names { XCTAssertTrue(GridCallAnalyzer.isLikelyName(n), "should keep name: \(n)") } for n in names { XCTAssertTrue(GridCallAnalyzer.isLikelyName(n), "should keep name: \(n)") }
for j in junk { XCTAssertFalse(GridCallAnalyzer.isLikelyName(j), "should drop junk: \(j)") } for j in junk { XCTAssertFalse(GridCallAnalyzer.isLikelyName(j), "should drop junk: \(j)") }
} }
func testHollowRingKeptFilledTileRejected() {
// A thin ring (border): points only on the perimeter of a 120×120 box.
var ring: [CGPoint] = []
for t in stride(from: 0.0, through: 120, by: 4) {
ring.append(.init(x: t, y: 0)); ring.append(.init(x: t, y: 120))
ring.append(.init(x: 0, y: t)); ring.append(.init(x: 120, y: t))
}
let rbb = GridCallAnalyzer.boundingBox(ring)
XCTAssertTrue(GridCallAnalyzer.isHollow(ring, bbox: rbb, maxInteriorFill: 0.2))
// A solid fill (camera-off avatar tile): points across the whole box.
var blob: [CGPoint] = []
for x in stride(from: 0.0, through: 120, by: 4) {
for y in stride(from: 0.0, through: 120, by: 4) { blob.append(.init(x: x, y: y)) }
}
let bbb = GridCallAnalyzer.boundingBox(blob)
XCTAssertFalse(GridCallAnalyzer.isHollow(blob, bbox: bbb, maxInteriorFill: 0.2))
}
func testWhiteBorderDetectorIgnoresColouredBorder() { func testWhiteBorderDetectorIgnoresColouredBorder() {
// Signal looks only for the white border, so a coloured (Meet) border must // Signal looks only for the white border, so a coloured (Meet) border must
// not register as a Signal speaker. // not register as a Signal speaker.
@@ -1,35 +0,0 @@
import XCTest
@testable import Ten31Transcripts
/// The TLS bypass is an opt-in escape hatch scoped to the configured backend host.
/// These cover the security gate (`allowsTrustOverride`) so a regression can't widen
/// it back to "trust any server". The gate is pure, so no network or SecTrust needed.
final class InsecureTrustDelegateTests: XCTestCase {
private func space(host: String,
method: String = NSURLAuthenticationMethodServerTrust) -> URLProtectionSpace {
URLProtectionSpace(host: host, port: 62419, protocol: "https",
realm: nil, authenticationMethod: method)
}
func testFiresForMatchingHost() {
let d = InsecureTrustDelegate(allowedHost: "192.0.2.1")
XCTAssertTrue(d.allowsTrustOverride(for: space(host: "192.0.2.1")))
}
func testRejectsMismatchedHost() {
let d = InsecureTrustDelegate(allowedHost: "192.0.2.1")
XCTAssertFalse(d.allowsTrustOverride(for: space(host: "evil.example.com")))
}
func testNilAllowedHostNeverFires() {
let d = InsecureTrustDelegate(allowedHost: nil)
XCTAssertFalse(d.allowsTrustOverride(for: space(host: "192.0.2.1")))
}
func testOnlyServerTrustMethodFires() {
// Matching host but a non-server-trust challenge (e.g. HTTP Basic) must not override.
let d = InsecureTrustDelegate(allowedHost: "192.0.2.1")
XCTAssertFalse(d.allowsTrustOverride(
for: space(host: "192.0.2.1", method: NSURLAuthenticationMethodHTTPBasic)))
}
}
-39
View File
@@ -37,45 +37,6 @@ final class Phase5Tests: XCTestCase {
XCTAssertEqual(asm.speakersFile.segments[0].start, 152, accuracy: 0.01) XCTAssertEqual(asm.speakersFile.segments[0].start, 152, accuracy: 0.01)
} }
func testChunkModeResolvesBodyLength() {
// Fixed presets ignore participant count.
XCTAssertEqual(ChunkMode.standard.bodySeconds(participantCount: 99), 150)
XCTAssertEqual(ChunkMode.largeGroup.bodySeconds(participantCount: 2), 60)
XCTAssertEqual(ChunkMode.fine.bodySeconds(participantCount: nil), 90)
// Auto: >4 detected 60s, 4 150s, unknown 150s.
XCTAssertEqual(ChunkMode.auto.bodySeconds(participantCount: 6), 60)
XCTAssertEqual(ChunkMode.auto.bodySeconds(participantCount: 4), 150)
XCTAssertEqual(ChunkMode.auto.bodySeconds(participantCount: nil), 150)
}
func testChunkOverlapScalesWithBody() {
XCTAssertEqual(ChunkMode.overlapSeconds(forBody: 150), 15) // capped
XCTAssertEqual(ChunkMode.overlapSeconds(forBody: 60), 8) // floored (60*0.12=7.28)
XCTAssertEqual(ChunkMode.overlapSeconds(forBody: 90), 11) // 90*0.12=10.811
}
func testPlanChunksShortBodyChunksAShortCall() {
// A 100s call would be ONE chunk at the 2.5-min default, but at a 60s body it
// splits so "Large group" actually re-chunks medium calls.
let c = SessionPackager.planChunks(durationSec: 100, chunkSeconds: 60,
overlapSeconds: 8, thresholdSec: 72)
XCTAssertEqual(c.count, 2)
XCTAssertEqual(c[0].bodyStart, 0); XCTAssertEqual(c[0].bodyEnd, 60)
XCTAssertEqual(c[1].bodyStart, 60); XCTAssertEqual(c[1].bodyEnd, 100)
}
func testDropStuckSpansRemovesWholeCallCue() {
let segs = [
VisualTimeline.Segment(start: 0, end: 1900, name: "Grant Gilliam", confidence: 1, source: "vision"), // stuck whole-call tile
VisualTimeline.Segment(start: 100, end: 130, name: "Matt Odell", confidence: 0.9, source: "vision"), // real
VisualTimeline.Segment(start: 0, end: 1900, name: "Grant", confidence: 1, source: "mic_vad"), // self span: keep
]
let out = TranscriptPipeline.dropStuckSpans(segs, duration: 1976)
XCTAssertFalse(out.contains { $0.name == "Grant Gilliam" }) // 96% of call in one span dropped
XCTAssertTrue(out.contains { $0.name == "Matt Odell" }) // short real span kept
XCTAssertTrue(out.contains { $0.source == "mic_vad" }) // self never dropped
}
func testRebaseClipsAndRebases() throws { func testRebaseClipsAndRebases() throws {
let segs = [ let segs = [
VisualTimeline.Segment(start: 140, end: 160, name: "A", confidence: 0.9, source: "vision"), VisualTimeline.Segment(start: 140, end: 160, name: "A", confidence: 0.9, source: "vision"),
@@ -1,110 +0,0 @@
import XCTest
@testable import Ten31Transcripts
final class SessionNamingTests: XCTestCase {
// MARK: sanitize
func testSanitizeTrimsAndKeepsSpaces() {
XCTAssertEqual(SessionNaming.sanitize(" Weekly Sync "), "Weekly Sync")
}
func testSanitizeReplacesPathSeparators() {
XCTAssertEqual(SessionNaming.sanitize("9/10 standup"), "9-10 standup")
XCTAssertEqual(SessionNaming.sanitize("a:b\\c"), "a-b-c")
}
func testSanitizeCollapsesWhitespaceRuns() {
XCTAssertEqual(SessionNaming.sanitize("board 1:1"), "board 1-1")
}
func testSanitizeStripsLeadingDots() {
XCTAssertEqual(SessionNaming.sanitize("...hidden"), "hidden")
XCTAssertEqual(SessionNaming.sanitize(".."), "")
}
func testSanitizeEmptyForBlankOrWhitespace() {
XCTAssertEqual(SessionNaming.sanitize(""), "")
XCTAssertEqual(SessionNaming.sanitize(" \n\t "), "")
}
func testSanitizeCapsLength() {
let long = String(repeating: "x", count: 200)
XCTAssertEqual(SessionNaming.sanitize(long).count, 60)
}
func testSanitizeStripsControlCharacters() {
XCTAssertEqual(SessionNaming.sanitize("a\u{0000}b\u{001F}c"), "abc")
}
// MARK: datePrefix
func testDatePrefixFromAutoName() {
XCTAssertEqual(SessionNaming.datePrefix(ofSessionNamed: "2026-06-17T09-59-48_signal"), "2026-06-17")
}
func testDatePrefixFromRenamedName() {
XCTAssertEqual(SessionNaming.datePrefix(ofSessionNamed: "2026-06-17_Weekly sync_signal"), "2026-06-17")
}
// MARK: renamedLeaf
func testRenamedLeafBasic() {
XCTAssertEqual(
SessionNaming.renamedLeaf(date: "2026-06-17", app: "signal", meetingName: "Weekly sync"),
"2026-06-17_Weekly sync_signal")
}
func testRenamedLeafAppStaysLastSegment() {
// The meeting name may contain underscores; the app must remain parseable as
// the final "_"-segment (what SessionController.appLabel reads).
let leaf = SessionNaming.renamedLeaf(date: "2026-06-17", app: "meet", meetingName: "q3_planning")
XCTAssertEqual(leaf, "2026-06-17_q3_planning_meet")
XCTAssertEqual(leaf?.split(separator: "_").last.map(String.init), "meet")
}
func testRenamedLeafNilForBlankName() {
XCTAssertNil(SessionNaming.renamedLeaf(date: "2026-06-17", app: "signal", meetingName: " "))
}
func testRenamedLeafCounterDisambiguatesNameSegment() {
// A collision suffixes the NAME, not the whole leaf, so "_app" stays last.
let leaf = SessionNaming.renamedLeaf(date: "2026-06-17", app: "signal", meetingName: "sync", counter: 1)
XCTAssertEqual(leaf, "2026-06-17_sync-2_signal")
XCTAssertEqual(leaf?.split(separator: "_").last.map(String.init), "signal")
}
func testRenamedLeafAppStaysLastAtMaxCollisionDepth() {
// The 100-collision cap is counter 099; the app must still parse out last.
let leaf = SessionNaming.renamedLeaf(date: "2026-06-17", app: "signal", meetingName: "q3_sync", counter: 99)
XCTAssertEqual(leaf, "2026-06-17_q3_sync-100_signal")
XCTAssertEqual(leaf?.split(separator: "_").last.map(String.init), "signal")
}
// MARK: recapTitle
func testRecapTitleAutoNamePreservesLegacyFormat() {
XCTAssertEqual(
SessionNaming.recapTitle(app: "meet", sessionId: "2026-06-06T11-43-02_meet"),
"Google Meet call — 2026-06-06 11:43")
}
func testRecapTitleNamedSession() {
XCTAssertEqual(
SessionNaming.recapTitle(app: "meet", sessionId: "2026-06-06_Weekly sync_meet"),
"Weekly sync — Google Meet (2026-06-06)")
}
func testRecapTitleNamePreservesUnderscores() {
// A meeting name with underscores must survive the split/join round-trip.
XCTAssertEqual(
SessionNaming.recapTitle(app: "meet", sessionId: "2026-06-06_q3_planning_meet"),
"q3_planning — Google Meet (2026-06-06)")
}
func testRecapTitleUnknownAppCapitalizes() {
XCTAssertEqual(
SessionNaming.recapTitle(app: "manual", sessionId: "2026-06-06T11-43-02_manual"),
"Manual call — 2026-06-06 11:43")
}
}
@@ -11,27 +11,6 @@ final class VisualObserverTests: XCTestCase {
(id, CGRect(x: 0, y: 0, width: w, height: h)) (id, CGRect(x: 0, y: 0, width: w, height: h))
} }
func testCanonicalizeFoldsOcrMisspellingsIntoDominantName() {
func seg(_ s: Double, _ e: Double, _ n: String) -> VisualTimeline.Segment {
.init(start: s, end: e, name: n, confidence: 0.9, source: "vision")
}
let segs = [
seg(0, 1689, "Matt Odell"), // dominant
seg(1700, 1702, "Matt Odel"), // OCR typo fold
seg(1702, 1702.3, "MattOdell"), // dropped-space typo fold
seg(0, 1155, "Mark"), // dominant
seg(1200, 1201, "Mare"), // OCR typo fold into Mark
seg(0, 4, "Sidisel"), // screen junk, no near-twin kept (dropped later, no voice match)
]
let names = Set(TimelineBuilder.canonicalizeByFrequency(segs).map { $0.name })
XCTAssertTrue(names.contains("Matt Odell"))
XCTAssertTrue(names.contains("Mark"))
XCTAssertFalse(names.contains("Matt Odel"))
XCTAssertFalse(names.contains("MattOdell"))
XCTAssertFalse(names.contains("Mare"))
XCTAssertTrue(names.contains("Sidisel"))
}
func testPrefersMatchingWindowIDOverLargest() { func testPrefersMatchingWindowIDOverLargest() {
// The Meet window (id 42) is NOT the largest must still be chosen by ID. // The Meet window (id 42) is NOT the largest must still be chosen by ID.
let candidates = [c(7, 1600, 1000), c(42, 800, 600), c(9, 1200, 900)] let candidates = [c(7, 1600, 1000), c(42, 800, 600), c(9, 1200, 900)]
+34 -46
View File
@@ -7,9 +7,9 @@
> returns named transcript segments. A growing **voiceprint library** recovers > returns named transcript segments. A growing **voiceprint library** recovers
> speakers even when the visual cue is missing. > speakers even when the visual cue is missing.
Master context document. Read this first, then `02_ARCHITECTURE.md` and Master context document. Read this first, then `02_ARCHITECTURE.md`,
`03_DATA_CONTRACTS.md`. The SparkControl API is fully specified in `03_DATA_CONTRACTS.md`, `04_BUILD_PLAN.md`. The SparkControl API is now fully
`03_DATA_CONTRACTS.md`. specified — see `03_DATA_CONTRACTS.md` (and the source `AUDIO_API.md`).
--- ---
@@ -20,30 +20,25 @@ A lightweight, always-running **menu-bar app on macOS** that:
1. **Detects** when the user joins a call in Google Meet, Zoom, Microsoft Teams, 1. **Detects** when the user joins a call in Google Meet, Zoom, Microsoft Teams,
or Signal. or Signal.
2. **Records two local audio tracks** — system audio (everyone else) and the 2. **Records two local audio tracks** — system audio (everyone else) and the
user's microphone (the user). It sends the backend **dual-channel** user's microphone (the user) — and **mixes them to one 16 kHz mono WAV** for
(`mic_file` + `system_file`) when the system track is healthy, falling back to the backend.
a **mixed-mono 16 kHz WAV** otherwise.
3. **Watches the call window** at ~24 fps and, per app, reads participant 3. **Watches the call window** at ~24 fps and, per app, reads participant
**names** and the **active-speaker cue**, producing a **names** and the **active-speaker cue**, producing a
`(start, end, name, confidence)` **visual timeline** — its best guess at who `(start, end, name, confidence)` **visual timeline** — its best guess at who
was talking when. was talking when.
4. **Discards every video frame after extraction.** No video is ever written to 4. **Discards every video frame after extraction.** No video is ever written to
disk. Only audio + the derived timeline persist locally. disk. Only audio + the derived timeline persist locally.
5. On call end, **POSTs the audio + the visual timeline (+ the known voiceprint 5. On call end, **POSTs the mixed audio + the visual timeline (+ the known
library) to `POST /api/audio/label-merge`** on SparkControl, which returns voiceprint library) to `POST /api/audio/label-merge`** on SparkControl, which
**named, speaker-attributed transcript segments** and a **voiceprint per returns **named, speaker-attributed transcript segments** and a **voiceprint
speaker**. per speaker**.
6. **Persists the returned voiceprints** keyed by name, so the next call can pass 6. **Persists the returned voiceprints** keyed by name, so the next call can pass
them as `known_voiceprints` and recover a speaker by voice when the visual cue them as `known_voiceprints` and recover a speaker by voice when the visual cue
is absent (camera off, a bad OCR frame). is absent (camera off, a bad OCR frame).
7. **Renders the result locally** — a readable `transcript.md` plus an HTML
`recap.html` (topics + meeting extras, generated via the backend's LLM
endpoint), with an in-app editor for fixing speaker names after the fact.
The app's job ends at producing the named transcript and recap from SparkControl's The app's job ends at receiving and storing the named segments from SparkControl.
segments. **All transcription, diarization, name-merge, and LLM analysis happen on **All transcription, diarization, and the name-merge happen on the backend.** Do
the backend.** Do not build transcription, diarization, or the merge vote in this not build transcription, diarization, or the merge vote in this app.
app.
## 2. Why the visual timeline still matters (the core idea) ## 2. Why the visual timeline still matters (the core idea)
@@ -73,25 +68,19 @@ few calls the system can name regulars even with cameras off.
**In scope (this app):** **In scope (this app):**
- Call detection for Meet / Zoom / Teams / Signal. - Call detection for Meet / Zoom / Teams / Signal.
- Dual-track local audio capture; **dual-channel send** (mic + system) with a - Dual-track local audio capture + mix-to-mono for the backend.
mix-to-mono fallback for the backend.
- Low-fps window capture → OCR (names) + active-speaker cue detection. - Low-fps window capture → OCR (names) + active-speaker cue detection.
- Per-app "adapter" modules encapsulating each app's UI quirks. - Per-app "adapter" modules encapsulating each app's UI quirks.
- Building the visual timeline; **mic-VAD self-labeling** (the mic track is the - Building the visual timeline; **mic-VAD self-labeling** (the mic track is the
user, so hot-mic spans pre-seed the user's name into the timeline). user, so hot-mic spans pre-seed the user's name into the timeline).
- Chunking long calls (~23 min) and calling `label-merge` **sequentially**. - Chunking long calls (~23 min) and calling `label-merge` **sequentially**.
- A local **voiceprint store** (persist + replay named voiceprints). - A local **voiceprint store** (persist + replay named voiceprints).
- Storing the backend's named segments and **rendering** them — `transcript.md` - Storing the backend's named transcript segments locally.
plus an HTML `recap.html` (recap analysis via the backend LLM) — with an in-app - A minimal menu-bar UI: status, manual start/stop, recent sessions, adapter
speaker-name editor. toggles, backend host/health, output folder.
- A minimal menu-bar UI: status, manual start/stop, the last session (reveal,
resend, open recap, edit speakers), adapter toggles, backend host/health,
output folder.
**Out of scope (owned by the backend):** **Out of scope (owned by the backend):**
- Transcription, diarization, the name-merge vote, and LLM summarization — these - Transcription, diarization, the name-merge vote, summarization/analysis.
run on the backend; the app only orchestrates the recap call and renders the
result.
**Explicitly not doing:** saving video; cloud anything. Everything stays on the **Explicitly not doing:** saving video; cloud anything. Everything stays on the
operator's LAN. operator's LAN.
@@ -102,14 +91,14 @@ operator's LAN.
|---|---|---| |---|---|---|
| Language / framework | Native Swift + SwiftUI menu-bar app (`LSUIElement`) | System audio, window capture, Vision all native; one codebase. | | Language / framework | Native Swift + SwiftUI menu-bar app (`LSUIElement`) | System audio, window capture, Vision all native; one codebase. |
| Audio capture | ScreenCaptureKit (system audio) + AVFoundation (mic) | No virtual audio device; works with headphones; macOS 13+. | | Audio capture | ScreenCaptureKit (system audio) + AVFoundation (mic) | No virtual audio device; works with headphones; macOS 13+. |
| Backend audio format | **Dual-channel (mic + system)** when the system track is healthy, else **mixed-mono 16 kHz WAV** | Separate tracks let the backend attribute the user's mic channel directly; the diarizer can still split the mono fallback. | | Backend audio format | **Mixed-mono 16 kHz WAV** | Diarizer separates speakers from one mixed stream; 16 kHz is ideal. |
| Call detection | CoreAudio "mic running somewhere" + known-app / Meet-tab heuristic | Clean live-mic signal + app disambiguation. | | Call detection | CoreAudio "mic running somewhere" + known-app / Meet-tab heuristic | Clean live-mic signal + app disambiguation. |
| Speaker naming | **Backend, via `POST /api/audio/label-merge`** | One call does diarize + overlap-vote naming + transcription. No client merge. | | Speaker naming | **Backend, via `POST /api/audio/label-merge`** | One call does diarize + overlap-vote naming + transcription. No client merge. |
| Identity recovery | **Local voiceprint library** replayed as `known_voiceprints` | Recovers camera-off / OCR-missed speakers by voice; compounds over calls. | | Identity recovery | **Local voiceprint library** replayed as `known_voiceprints` | Recovers camera-off / OCR-missed speakers by voice; compounds over calls. |
| Self-identity | mic-VAD → pre-seed user's name in timeline | The mic track is the user; gives the backend a strong prior + enrolls the user's voiceprint immediately. | | Self-identity | mic-VAD → pre-seed user's name in timeline | The mic track is the user; gives the backend a strong prior + enrolls the user's voiceprint immediately. |
| Requests | **Sequential, one audio request in flight** | Parallel audio requests trip a backend GPU race (`503 + Retry-After`). | | Requests | **Sequential, one audio request in flight** | Parallel audio requests trip a backend GPU race (`503 + Retry-After`). |
| Long calls | Chunk ~23 min, sequential, stitch via names+voiceprints | Diarizer caps at **4 speakers/chunk**; voiceprints + names unify across chunks. | | Long calls | Chunk ~23 min, sequential, stitch via names+voiceprints | Diarizer caps at **4 speakers/chunk**; voiceprints + names unify across chunks. |
| Transport / TLS | `multipart/form-data`, file field `file` (mono) or `mic_file` + `system_file` (dual-channel); self-signed Start9 cert (trust the Root CA — supported default; host-scoped skip-verify is an off-by-default escape hatch); **no auth on LAN** | Matches every other SparkControl endpoint. | | Transport / TLS | `multipart/form-data`, file field `file`; self-signed Start9 cert (skip verify or trust the Root CA); **no auth on LAN** | Matches every other SparkControl endpoint. |
| Timing | Batch after call (sync endpoints, no polling) | Endpoints are synchronous; no job/poll machinery needed. | | Timing | Batch after call (sync endpoints, no polling) | Endpoints are synchronous; no job/poll machinery needed. |
### On forking Hyprnote ### On forking Hyprnote
@@ -139,25 +128,24 @@ SparkControl, on the operator's Start9 LAN, fronting two DGX Sparks:
- **★ Primary endpoint for this app:** `POST /api/audio/label-merge` — diarize + - **★ Primary endpoint for this app:** `POST /api/audio/label-merge` — diarize +
name from the visual timeline (+ voiceprint fallback), optionally transcribe, name from the visual timeline (+ voiceprint fallback), optionally transcribe,
in one synchronous call. in one synchronous call.
- **LLM (recap):** Qwen3 via OpenAI-compatible `POST /v1/chat/completions`
generates the readable recap (topics + meeting extras) from the transcript.
- Health/discovery: `GET /api/status`, `GET /api/endpoints`, `GET /v1/models`. - Health/discovery: `GET /api/status`, `GET /api/endpoints`, `GET /v1/models`.
Full request/response shapes, curl examples, limits, and error formats are in Full request/response shapes, curl examples, limits, and error formats are in
`03_DATA_CONTRACTS.md`. `03_DATA_CONTRACTS.md`.
## 7. Settled decisions (were open at brief time) ## 7. Remaining open items (small)
1. **Base URL.** A private LAN host — a `.local` mDNS name (preferred over a raw 1. **Base URL — RESOLVED.** `https://192.168.1.72:62419`, also
IP, since it survives IP changes) — configured in Settings or via the `https://immense-voyage.local:62419` (prefer the `.local` form; it survives IP
`SPARK_BACKEND_URL` env var, never committed. A neutral placeholder ships as the changes). Ship the `.local` host as the default; keep it editable in settings.
default and stays editable in Settings. Service-discovery at `GET /api/endpoints`. Service-discovery at `GET /api/endpoints`.
2. **Send trigger.** Auto-send on call end is a setting (`autoSendOnStop`), **off 2. **Send trigger** — assume auto-POST on call end; expose a "hold for review"
by default** — the user reviews the session and sends manually unless they opt in. toggle if the user wants to eyeball the timeline first.
3. **Retention.** The session folder is kept after a successful hand-off (output 3. **Retention** — keep the session folder after a successful hand-off, or prune
location is configurable); nothing is pruned automatically. audio and keep only `speakers.json` + voiceprints? Default: keep everything,
4. **Voiceprint update policy.** Store/refresh the latest high-confidence vector user-configurable.
per name (`02_ARCHITECTURE.md §2.9`); a per-name running average is a possible 4. **Voiceprint update policy** — overwrite vs running-average a person's stored
later refinement. voiceprint across calls (see `02_ARCHITECTURE.md §2.9`). Start simple
5. **Signing.** A stable identity via `Config/Signing.xcconfig` (gitignored) keeps (store/refresh latest high-confidence), refine later.
macOS from re-prompting for permissions on each rebuild. 5. **Signing** — stable identity so macOS doesn't re-prompt for permissions on
each rebuild.
+6 -23
View File
@@ -64,9 +64,6 @@ pattern, the macOS APIs, and the SparkControl integration (now fully specified).
└────────────────┘ └────────────────────┘ └────────────────┘ └────────────────────┘
``` ```
(After `speakers.json`, a recap phase renders `transcript.md` + `recap.html` via
the backend LLM — see §2.11.)
## 2. Modules ## 2. Modules
### 2.1 `CallDetector` ### 2.1 `CallDetector`
@@ -179,10 +176,8 @@ Write the session folder and, if the call is longer than ~3 min, produce a
``` ```
### 2.7 `SparkControlClient` ### 2.7 `SparkControlClient`
Deliver to SparkControl. **Primary path = `POST /api/audio/label-merge`**. Sends Deliver to SparkControl. **Primary path = `POST /api/audio/label-merge`** with
**dual-channel** (`mic_file` + `system_file` + `self_name` + `self_vad`) when the `file`, `timeline`, `known_voiceprints`, `transcribe=true`.
system track is healthy, else the **mono** `file`; always with `timeline`,
`known_voiceprints`, `transcribe=true`.
- **Sequential only** — one audio request in flight (parallel ⇒ `503 + Retry-After`). - **Sequential only** — one audio request in flight (parallel ⇒ `503 + Retry-After`).
- **Self-signed TLS** — skip verification (`URLSession` delegate trusting the - **Self-signed TLS** — skip verification (`URLSession` delegate trusting the
Start9 cert) or trust the Root CA. **No auth on the LAN.** Start9 cert) or trust the Root CA. **No auth on the LAN.**
@@ -215,22 +210,10 @@ Local persistence of named voiceprints — the compounding-identity layer.
- Editable/clearable from the menu-bar UI (rename, delete a person, reset). - Editable/clearable from the menu-bar UI (rename, delete a person, reset).
### 2.10 `MenuBarUI` (SwiftUI, `LSUIElement`) ### 2.10 `MenuBarUI` (SwiftUI, `LSUIElement`)
Status (idle / detected / recording / finishing), manual start/stop with live Status (idle / detected / recording / uploading), manual start/stop, recent
mic/system level meters, and the **last session** — reveal in Finder, resend sessions (open folder, resend, delete), adapter toggles, **backend host + a
("Send to backend"), open recap, and edit speakers — plus "Open saved session…" health check** (`GET /api/status`), output folder, voiceprint manager, and a
to reprocess an existing folder. Also a **backend host + health check** permissions checklist (Screen Recording, Microphone, Accessibility).
(`GET /api/status`), adapter toggles, output folder, and a permissions checklist
(Microphone, Screen Recording, Accessibility). (No multi-session list or
voiceprint-manager UI yet — those are in `ROADMAP.md`.)
### 2.11 Recap (`RecapAnalyzer`, `RecapRenderer`)
After `speakers.json`, the recap phase turns the named transcript into the
human-readable deliverables. `RecapAnalyzer` calls the backend LLM
(`POST /v1/chat/completions`, Qwen3) for topics + meeting extras; `RecapRenderer`
writes `transcript.md` (one line per diarized utterance) and `recap.html` (+ a
`recap.json` sidecar). The in-app speaker editor (`SpeakerEditing` /
`RecapEditModel`) rewrites names across all outputs after the fact. All
language-model work stays on the backend; the app orchestrates and renders.
## 3. macOS frameworks & permissions ## 3. macOS frameworks & permissions
+16 -34
View File
@@ -1,7 +1,7 @@
# Data Contracts — Ten31 Transcripts # Data Contracts — Ten31 Transcripts
Companion to docs 01/02. Defines the files the app produces/stores and the **real Companion to docs 01/02. Defines the files the app produces/stores and the **real
SparkControl contract** (verified against the live backend). The `label-merge` SparkControl contract** (source of truth: `AUDIO_API.md`). The `label-merge`
endpoint is the app's primary integration point. endpoint is the app's primary integration point.
--- ---
@@ -69,33 +69,28 @@ When chunking, **slice to the chunk window and rebase to chunk-local seconds**
"app_version": "0.1.0" "app_version": "0.1.0"
} }
``` ```
(On the dual-channel path the backend gets `mic.wav` + `system.wav` directly; on (`mixed_mono_16k.wav` is the one the backend gets; the separate tracks are kept
the mono fallback it gets `mixed_mono_16k.wav`. The mic track is the user's known locally — the mic track is the user's known identity / VAD source.)
identity / VAD source. **Note:** the per-file `sha256` fields above are part of the
intended contract but are **not currently emitted** by the pipeline.)
--- ---
## 3. SparkControl — connection (real) ## 3. SparkControl — connection (real)
- **Base URL (confirmed):** a private LAN host — a `.local` mDNS name (preferred - **Base URL (confirmed):** `https://192.168.1.72:62419` — also reachable at
over a raw IP; it survives IP changes) — configured in Settings or via the `https://immense-voyage.local:62419` (the `.local` form survives IP changes;
`SPARK_BACKEND_URL` env var, and **never committed**. Service-discovery JSON is at **prefer it as the default**). Service-discovery JSON is at
`GET /api/endpoints` (returns current vLLM / Parakeet / Kokoro URLs). All audio `GET /api/endpoints` (returns current vLLM / Parakeet / Kokoro URLs). All audio
endpoints in §4–§5 hang off this base. **Make it a setting** so the host can endpoints in §4–§5 hang off this base. Still **make it a setting** so the host
change, and ship a neutral placeholder (`https://your-spark-backend.local`) as can change, but ship `https://immense-voyage.local:62419` as the default.
the default. - **TLS:** Start9 self-signed Root CA. Either skip verification (`URLSession`
- **TLS:** Start9 self-signed Root CA. Supported path: install the Start9 Root CA delegate trusting the cert; curl `-k`; `rejectUnauthorized:false`) **or** install
into the System keychain (default trust then succeeds). Skip-verification is an the Start9 Root CA into the trust store.
**off-by-default, host-scoped** escape hatch (`InsecureTrustDelegate`, scoped to
the configured backend host), not the default.
- **Auth:** **none on the LAN.** No token/key today. - **Auth:** **none on the LAN.** No token/key today.
- **Limits:** **200 MB/request** (`413` over); timeouts ~300 s (transcription), - **Limits:** **200 MB/request** (`413` over); timeouts ~300 s (transcription),
~600 s (diarization). **Send audio requests SEQUENTIALLY** — concurrent audio ~600 s (diarization). **Send audio requests SEQUENTIALLY** — concurrent audio
trips a GPU FFT race → `503 + Retry-After`. trips a GPU FFT race → `503 + Retry-After`.
- **Transport:** `multipart/form-data`. Audio file field is **`file`** on the mono - **Transport:** `multipart/form-data`, audio file field name **`file`** (bytes,
path, or **`mic_file`** + **`system_file`** on the dual-channel path (bytes, not not base64/path).
base64/path).
- **All endpoints are synchronous** (no job IDs / polling). - **All endpoints are synchronous** (no job IDs / polling).
- **Errors:** JSON `{"detail": "..."}`; `400` malformed, `413` too large, `503 + - **Errors:** JSON `{"detail": "..."}`; `400` malformed, `413` too large, `503 +
Retry-After` transient (retry after the interval). Retry-After` transient (retry after the interval).
@@ -109,16 +104,11 @@ Diarize + name clusters from the visual timeline (majority temporal overlap),
with voiceprint fallback, optionally transcribed. Synchronous. **Stateless** — with voiceprint fallback, optionally transcribed. Synchronous. **Stateless** —
the app owns the timeline and the voiceprint library. the app owns the timeline and the voiceprint library.
**Multipart fields** — two audio shapes: **mono** (`file`) or **dual-channel** **Multipart fields:**
(`mic_file` + `system_file`, preferred when the system track is healthy):
| field | required | notes | | field | required | notes |
|---|---|---| |---|---|---|
| `file` | mono path | mixed-mono WAV (the chunk, when chunking) | | `file` | **yes** | mixed-mono WAV (the chunk, when chunking) |
| `mic_file` | dual path | the user's mic track (chunk) — attributed to `self_name` | | `timeline` | **yes** | flat JSON array `[{"start","end","name","confidence"}]`, chunk-local seconds (§1.1) |
| `system_file` | dual path | the remote/system track (chunk) |
| `self_name` | dual path | the user's name; the mic channel is attributed to them |
| `self_vad` | no | chunk-local windows where the mic is genuinely the user (active + louder than system) |
| `timeline` | **yes** | flat JSON array `[{"start","end","name","confidence"}]`, chunk-local seconds (§1.1); on the dual path it names only the remote speakers |
| `known_voiceprints` | no | JSON `{"<name>":[192 floats], ...}` from `VoiceprintStore` | | `known_voiceprints` | no | JSON `{"<name>":[192 floats], ...}` from `VoiceprintStore` |
| `transcribe` | no | `"true"` to also return per-segment text (default false) | | `transcribe` | no | `"true"` to also return per-segment text (default false) |
| `min_overlap` | no | min fraction of a cluster's time overlapping the winning name (default `0.0`) | | `min_overlap` | no | min fraction of a cluster's time overlapping the winning name (default `0.0`) |
@@ -222,11 +212,3 @@ Loaded → `known_voiceprints` on every `label-merge` call. Updated from respons
`fingerprints` for `visual`/high-confidence `voiceprint` speakers only. Never `fingerprints` for `visual`/high-confidence `voiceprint` speakers only. Never
stores `Unknown_N`. Update policy (`02 §2.9`): start = store latest with stores `Unknown_N`. Update policy (`02 §2.9`): start = store latest with
`overlap_confidence ≥ ~0.8`; consider per-name running mean later. `overlap_confidence ≥ ~0.8`; consider per-name running mean later.
## 8. Recap outputs (`transcript.md`, `recap.{html,json}`)
After `speakers.json` is assembled, the recap phase renders the human-readable
deliverables: a `transcript.md` (one line per diarized utterance) and an HTML
`recap.html`, backed by a structured `recap.json`. The recap's topic/summary
content is generated by the **backend LLM** (`POST /v1/chat/completions`, Qwen3);
the app owns the rendering and the in-app **speaker-name editor**, which can rewrite
names across `speakers.json`, the transcript, and the recap after the fact.
-6
View File
@@ -1,11 +1,5 @@
# Build Plan — Ten31 Transcripts # Build Plan — Ten31 Transcripts
> **Status: COMPLETE (historical).** Phases 06 shipped and the app is in daily
> use; a recap phase (transcript + HTML recap via the backend LLM) was added after
> this plan was written. Kept as the original build log and as the map for the
> "Phase N" references in the code comments. Forward-looking work lives in
> `ROADMAP.md`; current status in `AGENTS.md`.
Companion to docs 0103. Phased plan for the Claude Code session, each phase with Companion to docs 0103. Phased plan for the Claude Code session, each phase with
a demoable milestone. Build in order; the risky/novel work (visual adapters) is a demoable milestone. Build in order; the risky/novel work (visual adapters) is
isolated for independent tuning. The SparkControl contract is now known isolated for independent tuning. The SparkControl contract is now known
+5 -8
View File
@@ -7,20 +7,17 @@ options:
createIntermediateGroups: true createIntermediateGroups: true
groupSortPosition: top groupSortPosition: top
# Signing identity (DEVELOPMENT_TEAM) is kept out of source in a gitignored xcconfig
# so the Team ID isn't committed. Copy Config/Signing.xcconfig.example to
# Config/Signing.xcconfig and set your team. Keeping the value stable is what makes
# macOS TCC grants (Mic / Screen Recording / Accessibility) persist across rebuilds.
configFiles:
Debug: Config/Signing.xcconfig
Release: Config/Signing.xcconfig
settings: settings:
base: base:
MARKETING_VERSION: "0.1.0" MARKETING_VERSION: "0.1.0"
CURRENT_PROJECT_VERSION: "1" CURRENT_PROJECT_VERSION: "1"
SWIFT_VERSION: "5.0" SWIFT_VERSION: "5.0"
CODE_SIGN_STYLE: Automatic CODE_SIGN_STYLE: Automatic
# Grant's free personal team (cert OU). Baked in so `xcodegen generate` keeps
# a STABLE signing identity across regenerations — macOS ties TCC permission
# grants (Mic / Screen Recording / Accessibility) to this identity, so a
# stable team is what makes those permissions persist across rebuilds.
DEVELOPMENT_TEAM: "BK4Y6CXN35"
targets: targets:
Ten31Transcripts: Ten31Transcripts: