Files
Grant Gilliam 85ea8fde45 Rewrite README for the shipped app; fix stale AppSettings comment
The README still described "Phase 0 (scaffold)" — no audio capture, call detection, screen reading, or backend hand-off — for an app that ships all of it. Rewrite it to document the real detect/record/send/transcribe/recap pipeline, the standalone build+install commands, backend and Start9 Root CA setup (skip-TLS is off by default and host-scoped, not on by default), output files, and the real project layout. Also fix the matching "Phase 0" comment in AppSettings.
2026-06-16 21:54:54 -05:00

6.7 KiB

Ten31 Transcripts

Native macOS menu-bar app that auto-detects conference calls, records dual-track audio while watching the call window for active-speaker cues, and hands the audio plus a visual speaker timeline to a self-hosted SparkControl backend that does the transcription, diarization, and speaker naming — producing named transcripts and meeting recaps.

It runs as a menu-bar-only app (no Dock icon). All machine-learning work lives on the backend; the app only records, watches, packages, and reconciles hints.

How it works

  1. Detect — a call in Google Meet, Zoom, Teams, or Signal starts; CallDetector notices and (optionally) auto-starts a session.
  2. Record + watch — dual-track audio (your mic + system output) is captured while ScreenCaptureKit samples the call window (~3 fps) to read names and spot the active speaker. Video frames are analyzed in memory and released immediately — never written to disk.
  3. Package + send — audio is chunked and sent to the backend, dual-channel (mic_file + system_file) when the system track is healthy, else a mono mix. The visual timeline rides along as naming hints. Backend calls are sequential (one in flight) to respect the single-GPU backend.
  4. Transcribe + name — the backend diarizes (Sortformer/TitaNet) and an LLM (Qwen3, via an OpenAI-compatible endpoint) assigns names, helped by the visual hints and your stored voiceprints.
  5. Reconcile + recap — the app reconciles speaker hints, then writes a readable transcript.md and an HTML recap.html. A built-in speaker editor lets you fix names after the fact.

You are identified by the mic channel plus the single name in Settings → Your name — that name is reserved so the LLM never assigns it to anyone else. (There's no per-platform display-name matching; your Zoom/Meet/Signal names can all differ.)

One-time setup

  1. Install Xcode from the Mac App Store (free; large download). Open it once and accept the license prompt.
  2. Install XcodeGen (generates the Xcode project from project.yml):
    brew install xcodegen
    
  3. Set your signing team. The Apple Team ID is kept out of source in a gitignored Config/Signing.xcconfig. Copy the template and set your team:
    cp Config/Signing.xcconfig.example Config/Signing.xcconfig   # then set DEVELOPMENT_TEAM
    
    xcodegen wires it in via configFiles, so Signing & Capabilities shows the team automatically. Keep the value stable so macOS preserves the app's permission (TCC) grants across rebuilds. Edit the xcconfig, not Xcode — xcodegen generate overwrites Xcode-side changes.
  4. Generate the project (re-run any time you add/remove/rename a source file):
    xcodegen generate
    
    This creates Ten31Transcripts.xcodeproj (gitignored — regenerate, don't edit).

Build & run

The simplest path is to open Ten31Transcripts.xcodeproj and press Run (⌘R).

To build a standalone app and install it (Xcode doesn't need to stay open) — note the DEVELOPER_DIR prefix: full Xcode lives at /Applications/Xcode.app but xcode-select may point at the Command Line Tools, so set it on every xcodebuild:

DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer xcodebuild \
  -project Ten31Transcripts.xcodeproj -scheme Ten31Transcripts \
  -configuration Release -derivedDataPath /tmp/ten31-release build
ditto /tmp/ten31-release/Build/Products/Release/Ten31Transcripts.app /Applications/Ten31Transcripts.app
open /Applications/Ten31Transcripts.app

The installed copy does not auto-update — rebuild and ditto again after changes.

Run the test suite:

DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer xcodebuild test \
  -project Ten31Transcripts.xcodeproj -scheme Ten31Transcripts \
  -destination 'platform=macOS' -derivedDataPath /tmp/ten31-dd

Permissions

The menu panel shows live status for the three permissions the app needs, each with Grant / Open Settings buttons:

  • Microphone — to record your side of the call.
  • Screen Recording — to capture system audio and watch the call window.
  • Accessibility — to read window/participant information.

Backend setup

Point the app at your SparkControl backend in Settings → SparkControl backend. The resolution order is: the value saved in Settings (UserDefaults) wins, else the SPARK_BACKEND_URL env var, else a neutral placeholder default. The committed default is only a placeholder (https://your-spark-backend.local) — your real LAN URL lives in Settings and never touches source.

The backend sits behind a Start9 self-signed Root CA. The supported path is to install the StartOS Root CA in your System keychain, after which normal TLS validation succeeds. Skip TLS verification is an opt-in escape hatch, off by default and scoped to the configured backend host — it never becomes "trust any server."

Output

Each session writes to ~/Ten31Transcripts/sessions/<timestamp>_<app>/ (configurable in Settings):

mic.wav  system.wav  mixed_mono_16k.wav    # audio (dual-track + mono mix)
self_vad.json  visual_timeline.json        # self voice-activity + visual hints
speakers.json  cluster_fingerprints.json   # reconciled speakers + voiceprints
transcript.md  recap.html  recap.json      # final outputs

Project layout

project.yml                # XcodeGen recipe → generates the .xcodeproj
Ten31Transcripts/
  App/         @main entry + AppDelegate
  Detection/   CallDetector — which app is in a call
  Audio/       dual-track capture, mixing, resampling, self-VAD
  Visual/      ScreenCaptureKit capture + grid analysis → speaker timeline
  Adapters/    per-app screen-readers (Meet, Zoom, Teams, Signal) + registry
  Session/     SessionController state machine, packaging, reconciliation
  Backend/     SparkControl + LLM clients, voiceprint store, TLS handling
  Recap/       transcript.md + recap.html rendering, speaker editor
  Permissions/ Settings/ UI/ Support/   (permissions, AppSettings, views, Info.plist)
Ten31TranscriptsTests/     # XCTest — pure logic (chunking, reconciliation, analyzer math)
docs/                      # architecture & data-contract design notes

Notes

  • App Sandbox is off and Hardened Runtime is off — this is a personal, LAN-only tool that must observe other apps. Revisit only if distributing.
  • Privacy: video frames are never written to disk; recordings, transcripts, and screenshots are gitignored and never committed.
  • AGENTS.md is the canonical reference for build commands, conventions, and current state; ROADMAP.md holds the backlog; docs/ holds the architecture and data-contract design notes.