97 lines
12 KiB
Markdown
97 lines
12 KiB
Markdown
# AGENTS.md — Ten31 Transcripts
|
||
|
||
Native macOS **menu-bar app** that detects video calls, records dual-track audio + watches the call window for active-speaker cues, and sends audio + a visual timeline to a self-hosted **SparkControl** backend that does transcription/diarization/naming — producing named transcripts and recaps.
|
||
|
||
## Stack (versions that matter)
|
||
- **Swift 5.0**, **SwiftUI** + AppKit, macOS **13.0** deployment target. `LSUIElement` (menu-bar only, no Dock icon).
|
||
- Project is generated by **XcodeGen** from `project.yml` (`brew install xcodegen`). `*.xcodeproj` is **gitignored** — regenerate, don't edit.
|
||
- Full Xcode lives at `/Applications/Xcode.app`, but `xcode-select` points at CommandLineTools → **set `DEVELOPER_DIR` for every `xcodebuild`**.
|
||
- Bundle id `xyz.ten31.transcripts`; `DEVELOPMENT_TEAM` (Apple Team ID) is set in a **gitignored `Config/Signing.xcconfig`** (copy `Config/Signing.xcconfig.example` and set your team). Keep it stable — a constant signing identity is what preserves TCC grants across rebuilds.
|
||
- Backend: SparkControl gateway at `$SPARK_BACKEND_URL` (a private LAN backend — IP or `.local` host; Start9 self-signed cert. Install the StartOS Root CA in the System keychain so normal TLS validation succeeds; skip-TLS is an opt-in, **host-scoped** escape hatch, **off by default** — see `InsecureTrustDelegate`). Resolution order: a value saved in **Settings → SparkControl backend** (UserDefaults) wins, else the `SPARK_BACKEND_URL` env var, else the placeholder default in `AppSettings.swift`. Diarization = Sortformer/TitaNet (**mono-only**, ~4 speakers/chunk); LLM = Qwen3 via OpenAI-compatible `/v1/chat/completions`; audio via `/api/audio/label-merge`.
|
||
|
||
## Commands
|
||
First time on a machine — create the local signing config (else `xcodegen generate`/signing won't find a team):
|
||
```
|
||
cp Config/Signing.xcconfig.example Config/Signing.xcconfig # then set DEVELOPMENT_TEAM
|
||
```
|
||
Regenerate the Xcode project (after adding/removing/renaming any source file):
|
||
```
|
||
xcodegen generate
|
||
```
|
||
Build + run all tests:
|
||
```
|
||
DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer xcodebuild test \
|
||
-project Ten31Transcripts.xcodeproj -scheme Ten31Transcripts \
|
||
-destination 'platform=macOS' -derivedDataPath /tmp/ten31-dd
|
||
```
|
||
Run a **single** test (target/class/method):
|
||
```
|
||
DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer xcodebuild test \
|
||
-project Ten31Transcripts.xcodeproj -scheme Ten31Transcripts \
|
||
-destination 'platform=macOS' -derivedDataPath /tmp/ten31-dd \
|
||
-only-testing:Ten31TranscriptsTests/SpeakerReconcilerTests/testCosine
|
||
```
|
||
Build only: replace `test` with `build`. **Lint/format:** none configured (no SwiftLint/SwiftFormat/Makefile); adding one is tracked in `ROADMAP.md`.
|
||
Build a standalone app and install/run it (Xcode does **not** need to stay open):
|
||
```
|
||
DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer xcodebuild \
|
||
-project Ten31Transcripts.xcodeproj -scheme Ten31Transcripts \
|
||
-configuration Release -derivedDataPath /tmp/ten31-release build
|
||
ditto /tmp/ten31-release/Build/Products/Release/Ten31Transcripts.app /Applications/Ten31Transcripts.app
|
||
open /Applications/Ten31Transcripts.app
|
||
```
|
||
**Fast validation harness** (preferred for visual/backend logic): compile the specific `Ten31Transcripts/**.swift` files plus a `main.swift` with `xcrun --sdk macosx swiftc -O ... main.swift -o x` and run against real fixtures (`example-screenshots/`) or saved sessions. Top-level code must live in the file literally named `main.swift`.
|
||
|
||
## Layout (day one)
|
||
- `Ten31Transcripts/App/` — `@main` entry + `AppDelegate`.
|
||
- `Ten31Transcripts/Session/` — `SessionController` (state machine), `TranscriptPipeline`, `SessionPackager` (chunking), `TranscriptAssembler`, `SpeakerReconciler`, `ChunkPlan` (`ChunkMode`), `SpeakersFile`.
|
||
- `Ten31Transcripts/Visual/` — `VisualCapture`/`VisualObserver` (ScreenCaptureKit, ~3fps), `GridCallAnalyzer` (+ `FrameSampler`, `TextRecognizer`, `TimelineBuilder`, `VisualTimeline`, `SpeakerObservation`).
|
||
- `Ten31Transcripts/Adapters/` — per-app screen-readers (`MeetAdapter`, `ZoomAdapter`, `TeamsAdapter`, `SignalAdapter`) + `AdapterRegistry`.
|
||
- `Ten31Transcripts/Audio/` — `AudioRecorder`, `MicVAD`, `ChannelSelfVAD`.
|
||
- `Ten31Transcripts/Backend/` — `SparkControlClient`, `GatewayLLMClient`, `VoiceprintStore`, `SparkControlHealth`, `InsecureTrustDelegate` (TLS skip).
|
||
- `Ten31Transcripts/Recap/` — `RecapAnalyzer`, `RecapRenderer` (writes `transcript.md` + `recap.html`), `RecapModels`, `RecapTemplate`, `SpeakerEditing`, `RecapEditModel`.
|
||
- `Ten31Transcripts/{Detection,Permissions,Settings,UI,Support}/` — `CallDetector`; `PermissionsManager`; `AppSettings` (UserDefaults); SwiftUI views + AppKit window hosts; `Info.plist` + entitlements.
|
||
- `Ten31TranscriptsTests/` — XCTest. `example-screenshots/` — real fixtures (gitignored). `docs/`, `README.md`.
|
||
- **Runtime output** (default `~/Ten31Transcripts/sessions/<ts>_<app>/`, configurable in Settings): `mic.wav`, `system.wav`, `mixed_mono_16k.wav`, `self_vad.json`, `visual_timeline.json`, `speakers.json` (output), `cluster_fingerprints.json`, `recap.{html,json}`, `transcript.md`.
|
||
|
||
## Conventions
|
||
- Match the surrounding file's style; small reviewable diffs; comments explain **why**, not what.
|
||
- Write/extend XCTest alongside non-trivial changes; pure logic (chunking, reconciliation, analyzer math) is unit-tested offline.
|
||
- Commits: imperative mood, concise; authored by Grant. Push to the self-hosted Gitea remote `origin` (branch `main`, over SSH) after committing; the remote URL lives in `.git/config`, kept out of source. Branch before committing; never commit to `main` without asking.
|
||
- Never commit recordings, transcripts, screenshots, or the generated `*.xcodeproj`.
|
||
- No API keys/tokens/passwords in the repo. The backend host (`$SPARK_BACKEND_URL`) and the Apple Team ID (`Config/Signing.xcconfig`, gitignored) are kept out of source — real values live in Settings/UserDefaults and the local xcconfig. Build env vars: `DEVELOPER_DIR` (required) and optional `SPARK_BACKEND_URL`.
|
||
- **Git history scrubbed (2026-06-13):** the private backend host + LAN IP were purged from all commits via `git filter-repo` (replaced with the `your-spark-backend.local` placeholder) and force-pushed; 0 hits across refs. Pre-rewrite backup bundle: `../ten31-transcripts-prehistory-rewrite.bundle`. A **second rewrite the same day** purged two backend LAN IPs that had slipped into a docs/test commit, replacing them with RFC 5737 documentation IPs (`192.0.2.1`/`192.0.2.2`) and force-pushing; 0 hits across refs; backup bundle `../ten31-transcripts-pre-ip-scrub.bundle`. The Apple Team ID was intentionally **not** scrubbed (it's public in every signed binary) — don't re-flag it.
|
||
|
||
## Always
|
||
- Set `DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer` on every `xcodebuild`.
|
||
- Run `xcodegen generate` after adding/removing/renaming source files.
|
||
- Treat the backend as the owner of transcription, diarization, and speaker naming; the app only records, watches, packages, and reconciles hints.
|
||
- Identify **self by the mic channel** + the single name in Settings → Your name, and keep that name reserved so the LLM never assigns it to another speaker.
|
||
- Treat visual active-speaker cues as **naming hints over audio diarization** (the backbone): prefer sparse-but-correct detection over dense-but-wrong.
|
||
- Send the backend dual-channel (`mic_file` + `system_file`) when the system track is healthy, else the mono `mixed_mono_16k.wav`; keep backend calls **sequential** (one in flight).
|
||
- After any code change, rebuild Release + `ditto` to `/Applications` — the installed copy does **not** auto-update.
|
||
|
||
## Never
|
||
- **Never write video frames to disk** — analyze in-memory and release immediately (privacy non-negotiable).
|
||
- **Never add Co-Authored-By / "Generated with" / any AI or tool attribution** to commits or PRs.
|
||
- Never commit secrets, recordings, transcripts, or `example-screenshots/` (faces + contact names).
|
||
- Never do per-platform display-name matching for self (Zoom/Meet/Signal names differ) — channel + one canonical name only.
|
||
- Never treat a solid camera-off avatar tile (Meet's orange/magenta fill) as an active speaker — the real cue is a thin **hollow** coloured ring; require thin-edge + hue gate (see `GridCallAnalyzer.isHollow`, `FrameSampler.thinColoredPoints`).
|
||
- Never collapse adjacent same-speaker transcript segments (reverted by request) — one line per diarized utterance.
|
||
- Never send call audio to a raw IP the user didn't configure. The backend host (`$SPARK_BACKEND_URL`) is a private `.local` mDNS name a plain `swiftc` binary can't resolve via URLSession (`-1009`) — use the **real app** for backend runs (or `curl` for health checks).
|
||
- Never commit to `main` or force-push a shared branch; branch first and ask.
|
||
|
||
## Current state
|
||
Present tense; overwritten each session. 73 tests pass; `/Applications/Ten31Transcripts.app` matches HEAD and runs; working tree clean and pushed to `origin`/`main`. A full independent evaluation ran 2026-06-13 → `EVALUATION.md` (committed at repo root; overwritten + re-committed each run for a reviewable diff); its findings are triaged into the lists below. The eval's P1 (TLS) is now **fixed** and verified against the live backend.
|
||
- **Working:** call detection (Meet/Zoom/Teams/Signal), dual-track capture, dual-channel + chunked backend hand-off, speaker reconciliation, recap (`transcript.md` + recap-relay-styled `recap.html`), speaker editor, configurable chunk length, standalone Settings window.
|
||
- **In progress:** the Meet visual fix (reject solid camera-off tiles) is unverified end-to-end — no clean run exists yet; the saved Meet session's `visual_timeline.json` predates the fix.
|
||
- **Done this session (was eval P1):** TLS validation is now **on by default** and the skip-TLS escape hatch is **scoped to the configured host** (`InsecureTrustDelegate.allowsTrustOverride`, covered by `InsecureTrustDelegateTests`). Supported path = the StartOS Root CA trusted in the System keychain; verified `URLSession` default validation returns 200 against both the primary backend IP and its fallback.
|
||
- **Work queue (next up):** wire the backend URL + primary→fallback into config. Today it's a single `backendBaseURL` with no fallback logic, and on this Mac no value is saved (so it resolves to the `your-spark-backend.local` placeholder); the real setup is a primary LAN IP with a fallback IP (both port 62419) — the actual addresses live in Settings/UserDefaults, never source.
|
||
- **Known debt (P2 — fix before wider use):**
|
||
- `RecapAnalyzer.mmss()` fatally crashes on NaN/∞ (reproduced 2×); a malformed/MITM'd backend `duration` (e.g. `1e400` → `Double.infinity`) aborts the app at recap-render time — add a finite-guard fallback (`RecapAnalyzer.swift:137`).
|
||
- README is stale by six phases — still says "Phase 0 (scaffold) / no audio capture, detection, or backend hand-off yet" for a shipped Phase-6 app; same lie in source comment `AppSettings.swift:7`; and `README.md:49` still calls skip-TLS "on by default" (now off). Rewrite to match reality.
|
||
- `SessionController` (670 lines, the most concurrency-dense file) has zero unit tests — cover `pendingAutoStop` (auto-start-then-immediate-call-end) and the visual-adoption generation guard before any refactor.
|
||
- **Deferred (P3 — later decision or bulk cleanup; full evidence in `EVALUATION.md`):** `docs/` specs drifted from the dual-channel API + recap phase; `docs/01` §7 lists already-resolved open items; `docs/02` §2.10 claims MenuBarUI features that don't exist; AGENTS.md Layout listings under `Audio/`/`Detection/` are incomplete; the `manifest.json` sha256 contract is specced but never written; env-var precedence footgun (saved URL shadows `SPARK_BACKEND_URL`); `SessionController` owns three jobs (extract the open-panel UI); unused `NSAppleEventsUsageDescription`; unauthenticated LAN backend (consider a shared bearer token).
|
||
- **Known bugs:** Meet speaking-detection is sparse (faint blue border); the mic channel emits some sub-second junk "self" fragments; the same person on desktop-mic vs phone-speakerphone does not unify by voiceprint.
|
||
- **Next (product validation — no agent could reach the live backend, so this stays manual):** (1) re-process the saved Meet session in the app, then read its `speakers.json` + `cluster_fingerprints.json` to confirm ~4 speakers recover; (2) record a fresh Meet call to validate the visual fix on a clean capture. (The old "confirm Your name = Grant" item is moot — the committed default is the generic `"Me"`; "Grant" only ever lives in local UserDefaults.)
|