Phase 0: menu-bar scaffold, permissions, backend health check

Native SwiftUI menu-bar app (LSUIElement, macOS 13+), generated from project.yml via XcodeGen. Includes: - PermissionsManager (Microphone / Screen Recording / Accessibility) + UI - SparkControlHealth: GET /api/status over self-signed TLS (InsecureTrustDelegate) - AppSettings persistence (host, TLS-skip, output folder, adapter toggles) - Menu-bar panel + Settings, app sandbox & hardened runtime off (LAN tool)
2026-06-05 19:33:53 -05:00
commit b2ae3a62b9
19 changed files with 1448 additions and 0 deletions
@@ -0,0 +1,16 @@
+# macOS
+.DS_Store
+
+# Xcode project is generated from project.yml by XcodeGen.
+# Regenerate any time with:  xcodegen generate
+*.xcodeproj
+*.xcworkspace
+xcuserdata/
+DerivedData/
+build/
+
+# Swift / SwiftPM
+.build/
+
+# App output (never commit recordings or transcripts)
+/Ten31Transcripts-output/
@@ -0,0 +1,64 @@
+# Ten31 Transcripts
+
+Native macOS menu-bar app that auto-detects conference calls, records local audio,
+builds a visual-derived speaker timeline, and hands audio + timeline to the
+SparkControl backend for naming/transcription. See `docs/` for the full spec.
+
+This repo is at **Phase 0** (scaffold, permissions, backend health check).
+
+## One-time setup
+
+1. **Install Xcode** from the Mac App Store (free; ~40 GB). Open it once and
+   accept the license prompt.
+2. **Install XcodeGen** (generates the Xcode project from `project.yml`):
+   ```sh
+   brew install xcodegen
+   ```
+3. **Generate the project:**
+   ```sh
+   xcodegen generate
+   ```
+   This creates `Ten31Transcripts.xcodeproj` (git-ignored — regenerate any time).
+4. **Open it:**
+   ```sh
+   open Ten31Transcripts.xcodeproj
+   ```
+5. In Xcode, select the **Ten31Transcripts** target → **Signing & Capabilities**:
+   - Check **Automatically manage signing**.
+   - For **Team**, pick your personal team (sign in with your Apple ID — free; no
+     paid developer account needed). A stable team keeps macOS from re-asking for
+     permissions on every rebuild.
+6. Press **Run** (⌘R).
+
+## What Phase 0 does
+
+- Launches as a menu-bar-only app (no Dock icon).
+- Menu panel shows live status for the three permissions it needs — **Microphone**,
+  **Screen Recording**, **Accessibility** — with Grant / Open Settings buttons.
+- Shows a **backend health check** (`GET /api/status`) against the configured host.
+- **Settings:** backend base URL, skip-TLS toggle (on by default for the
+  self-signed cert), output folder, and adapter toggles (inert this phase).
+
+No audio capture, call detection, screen reading, or backend hand-off yet — those
+arrive in Phases 1–6 (`docs/04_BUILD_PLAN.md`).
+
+## Project layout
+
+```
+project.yml                     # XcodeGen recipe → generates the .xcodeproj
+Ten31Transcripts/
+  App/        Ten31TranscriptsApp.swift, AppDelegate.swift
+  UI/         MenuBarView, SettingsView, PermissionRow
+  Permissions/PermissionsManager.swift
+  Backend/    SparkControlHealth.swift, InsecureTrustDelegate.swift
+  Settings/   AppSettings.swift
+  Support/    Info.plist, Ten31Transcripts.entitlements
+Ten31TranscriptsTests/          # placeholder; real tests land in Phase 3
+```
+
+## Notes
+
+- **App Sandbox is off** and **Hardened Runtime is off** — this is a personal,
+  LAN-only tool that must observe other apps. Revisit only if distributing.
+- The default backend host is `https://your-spark-backend.local:62419` (editable in
+  Settings).
@@ -0,0 +1,10 @@
+import AppKit
+
+final class AppDelegate: NSObject, NSApplicationDelegate {
+    func applicationDidFinishLaunching(_ notification: Notification) {
+        // Run as a menu-bar accessory (no Dock icon, no main window).
+        // LSUIElement in Info.plist already enforces this; set it explicitly too
+        // so behavior is unambiguous regardless of how the app is launched.
+        NSApp.setActivationPolicy(.accessory)
+    }
+}
@@ -0,0 +1,28 @@
+import SwiftUI
+
+/// Menu-bar-only app entry point.
+///
+/// `LSUIElement` (set in Info.plist) keeps the app out of the Dock; the
+/// `MenuBarExtra` scene provides the status-bar item and its panel. Phase 0 only
+/// wires up permissions, settings, and a backend health check — no audio,
+/// capture, or call detection yet.
+@main
+struct Ten31TranscriptsApp: App {
+    @NSApplicationDelegateAdaptor(AppDelegate.self) private var appDelegate
+
+    @StateObject private var settings = AppSettings()
+    @StateObject private var permissions = PermissionsManager()
+    @StateObject private var health = SparkControlHealth()
+
+    var body: some Scene {
+        MenuBarExtra {
+            MenuBarView()
+                .environmentObject(settings)
+                .environmentObject(permissions)
+                .environmentObject(health)
+        } label: {
+            Image(systemName: "waveform.circle")
+        }
+        .menuBarExtraStyle(.window)
+    }
+}
@@ -0,0 +1,24 @@
+import Foundation
+
+/// URLSession delegate that trusts the server certificate without validation.
+///
+/// SparkControl sits behind a Start9 self-signed Root CA on the LAN, so default
+/// trust evaluation rejects it. This delegate is used **only** when the
+/// "Skip TLS verification" setting is on. It trusts any server certificate —
+/// acceptable for a personal tool on a trusted local network and nothing else.
+final class InsecureTrustDelegate: NSObject, URLSessionDelegate {
+    func urlSession(
+        _ session: URLSession,
+        didReceive challenge: URLAuthenticationChallenge,
+        completionHandler: @escaping (URLSession.AuthChallengeDisposition, URLCredential?) -> Void
+    ) {
+        guard
+            challenge.protectionSpace.authenticationMethod == NSURLAuthenticationMethodServerTrust,
+            let serverTrust = challenge.protectionSpace.serverTrust
+        else {
+            completionHandler(.performDefaultHandling, nil)
+            return
+        }
+        completionHandler(.useCredential, URLCredential(trust: serverTrust))
+    }
+}
@@ -0,0 +1,66 @@
+import Foundation
+import Combine
+
+/// Performs the Phase 0 backend reachability check: `GET {baseURL}/api/status`.
+///
+/// This is a thin slice — the full `SparkControlClient` (label-merge, multipart,
+/// sequential queueing, retries) arrives in Phase 5.
+@MainActor
+final class SparkControlHealth: ObservableObject {
+
+    enum Status: Equatable {
+        case unknown
+        case checking
+        case online(String)
+        case offline(String)
+    }
+
+    @Published private(set) var status: Status = .unknown
+    @Published private(set) var lastChecked: Date?
+
+    func check(baseURL: String, skipTLS: Bool) async {
+        status = .checking
+
+        let trimmed = baseURL.trimmingCharacters(in: .whitespacesAndNewlines)
+        let base = trimmed.hasSuffix("/") ? String(trimmed.dropLast()) : trimmed
+        guard !base.isEmpty, let url = URL(string: base + "/api/status") else {
+            status = .offline("Invalid host URL")
+            return
+        }
+
+        let config = URLSessionConfiguration.ephemeral
+        config.timeoutIntervalForRequest = 8
+        config.waitsForConnectivity = false
+
+        let delegate: URLSessionDelegate? = skipTLS ? InsecureTrustDelegate() : nil
+        let session = URLSession(configuration: config, delegate: delegate, delegateQueue: nil)
+        defer { session.finishTasksAndInvalidate() }
+
+        do {
+            let (data, response) = try await session.data(from: url)
+            lastChecked = Date()
+            guard let http = response as? HTTPURLResponse else {
+                status = .offline("No HTTP response")
+                return
+            }
+            if (200..<300).contains(http.statusCode) {
+                status = .online(Self.summarize(data) ?? "Reachable")
+            } else {
+                status = .offline("HTTP \(http.statusCode)")
+            }
+        } catch {
+            lastChecked = Date()
+            status = .offline(error.localizedDescription)
+        }
+    }
+
+    /// Best-effort one-line summary of the `/api/status` body, if it's JSON.
+    private static func summarize(_ data: Data) -> String? {
+        guard let object = try? JSONSerialization.jsonObject(with: data) as? [String: Any] else {
+            return nil
+        }
+        if let s = object["status"] as? String { return s }
+        if let s = object["state"] as? String { return s }
+        return "Reachable"
+    }
+}
@@ -0,0 +1,89 @@
+import AVFoundation
+import CoreGraphics
+import ApplicationServices
+import AppKit
+import Combine
+
+enum PermissionState {
+    case granted
+    case denied
+    case notDetermined
+}
+
+/// Tracks and requests the three TCC permissions the app needs.
+///
+/// - Microphone: AVFoundation authorization (has a real "not determined" state).
+/// - Screen Recording: CoreGraphics preflight/request (binary granted/denied).
+/// - Accessibility: AXIsProcessTrusted (binary granted/denied).
+@MainActor
+final class PermissionsManager: ObservableObject {
+
+    @Published private(set) var microphone: PermissionState = .notDetermined
+    @Published private(set) var screenRecording: PermissionState = .notDetermined
+    @Published private(set) var accessibility: PermissionState = .notDetermined
+
+    init() {
+        refresh()
+    }
+
+    func refresh() {
+        microphone = Self.microphoneState()
+        screenRecording = CGPreflightScreenCaptureAccess() ? .granted : .denied
+        accessibility = AXIsProcessTrusted() ? .granted : .denied
+    }
+
+    // MARK: - Requests
+
+    func requestMicrophone() {
+        AVCaptureDevice.requestAccess(for: .audio) { _ in
+            Task { @MainActor in self.refresh() }
+        }
+    }
+
+    /// Triggers the system Screen Recording prompt on first call. The user must
+    /// still toggle the app on in System Settings; `refresh()` reflects it after.
+    func requestScreenRecording() {
+        _ = CGRequestScreenCaptureAccess()
+        refresh()
+    }
+
+    /// Shows the Accessibility trust prompt (deep-links to the right pane).
+    func requestAccessibility() {
+        // Literal is the value of `kAXTrustedCheckOptionPrompt`; used directly to
+        // stay robust across SDK import shapes of that constant.
+        let options = ["AXTrustedCheckOptionPrompt": true] as CFDictionary
+        _ = AXIsProcessTrustedWithOptions(options)
+        refresh()
+    }
+
+    func openSettings(_ pane: SettingsPane) {
+        guard let url = URL(string: pane.urlString) else { return }
+        NSWorkspace.shared.open(url)
+    }
+
+    // MARK: - Helpers
+
+    private static func microphoneState() -> PermissionState {
+        switch AVCaptureDevice.authorizationStatus(for: .audio) {
+        case .authorized:          return .granted
+        case .denied, .restricted: return .denied
+        case .notDetermined:       return .notDetermined
+        @unknown default:          return .notDetermined
+        }
+    }
+
+    enum SettingsPane {
+        case microphone
+        case screenRecording
+        case accessibility
+
+        var urlString: String {
+            let root = "x-apple.systempreferences:com.apple.preference.security?"
+            switch self {
+            case .microphone:      return root + "Privacy_Microphone"
+            case .screenRecording: return root + "Privacy_ScreenCapture"
+            case .accessibility:   return root + "Privacy_Accessibility"
+            }
+        }
+    }
+}
@@ -0,0 +1,66 @@
+import Foundation
+import Combine
+
+/// User-facing settings, persisted to `UserDefaults`.
+///
+/// Phase 0 scope: backend host + TLS-skip, output folder, and adapter toggles.
+/// The adapter toggles persist but do nothing yet (adapters arrive in Phase 3–4).
+@MainActor
+final class AppSettings: ObservableObject {
+
+    /// Adapters the app will eventually run, in display order.
+    static let adapterKeys: [(key: String, label: String)] = [
+        ("zoom",   "Zoom"),
+        ("teams",  "Microsoft Teams"),
+        ("signal", "Signal"),
+        ("meet",   "Google Meet"),
+    ]
+
+    @Published var backendBaseURL: String {
+        didSet { defaults.set(backendBaseURL, forKey: Keys.backendBaseURL) }
+    }
+
+    @Published var skipTLSVerification: Bool {
+        didSet { defaults.set(skipTLSVerification, forKey: Keys.skipTLS) }
+    }
+
+    @Published var outputFolderPath: String {
+        didSet { defaults.set(outputFolderPath, forKey: Keys.outputFolder) }
+    }
+
+    @Published var adapterEnabled: [String: Bool] {
+        didSet { defaults.set(adapterEnabled, forKey: Keys.adapterEnabled) }
+    }
+
+    /// Output folder as a resolved file URL (expands a leading `~`).
+    var outputFolderURL: URL {
+        URL(fileURLWithPath: (outputFolderPath as NSString).expandingTildeInPath,
+            isDirectory: true)
+    }
+
+    private let defaults: UserDefaults
+
+    init(defaults: UserDefaults = .standard) {
+        self.defaults = defaults
+
+        self.backendBaseURL = defaults.string(forKey: Keys.backendBaseURL)
+            ?? "https://your-spark-backend.local:62419"
+
+        self.skipTLSVerification = defaults.object(forKey: Keys.skipTLS) as? Bool ?? true
+
+        self.outputFolderPath = defaults.string(forKey: Keys.outputFolder)
+            ?? "~/Ten31Transcripts"
+
+        let stored = defaults.dictionary(forKey: Keys.adapterEnabled) as? [String: Bool]
+        self.adapterEnabled = stored ?? Dictionary(
+            uniqueKeysWithValues: Self.adapterKeys.map { ($0.key, true) }
+        )
+    }
+
+    private enum Keys {
+        static let backendBaseURL = "backendBaseURL"
+        static let skipTLS = "skipTLSVerification"
+        static let outputFolder = "outputFolderPath"
+        static let adapterEnabled = "adapterEnabled"
+    }
+}
@@ -0,0 +1,43 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
+<plist version="1.0">
+<dict>
+    <key>CFBundleDevelopmentRegion</key>
+    <string>en</string>
+    <key>CFBundleExecutable</key>
+    <string>$(EXECUTABLE_NAME)</string>
+    <key>CFBundleIdentifier</key>
+    <string>$(PRODUCT_BUNDLE_IDENTIFIER)</string>
+    <key>CFBundleInfoDictionaryVersion</key>
+    <string>6.0</string>
+    <key>CFBundleName</key>
+    <string>$(PRODUCT_NAME)</string>
+    <key>CFBundleDisplayName</key>
+    <string>Ten31 Transcripts</string>
+    <key>CFBundlePackageType</key>
+    <string>APPL</string>
+    <key>CFBundleShortVersionString</key>
+    <string>$(MARKETING_VERSION)</string>
+    <key>CFBundleVersion</key>
+    <string>$(CURRENT_PROJECT_VERSION)</string>
+    <key>LSMinimumSystemVersion</key>
+    <string>$(MACOSX_DEPLOYMENT_TARGET)</string>
+    <key>LSUIElement</key>
+    <true/>
+    <key>NSPrincipalClass</key>
+    <string>NSApplication</string>
+    <key>NSHumanReadableCopyright</key>
+    <string>Ten31</string>
+    <key>NSMicrophoneUsageDescription</key>
+    <string>Ten31 Transcripts records your microphone during calls to build the local audio track.</string>
+    <key>NSAppleEventsUsageDescription</key>
+    <string>Ten31 Transcripts reads the active browser tab's URL to detect Google Meet calls.</string>
+    <key>NSLocalNetworkUsageDescription</key>
+    <string>Ten31 Transcripts connects to your SparkControl server on the local network.</string>
+    <key>NSAppTransportSecurity</key>
+    <dict>
+        <key>NSAllowsLocalNetworking</key>
+        <true/>
+    </dict>
+</dict>
+</plist>
@@ -0,0 +1,11 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
+<plist version="1.0">
+<dict>
+    <!-- App Sandbox is intentionally OFF: the app must read other apps' windows,
+         audio, accessibility trees, and the browser's active-tab URL, which the
+         sandbox forbids. This is a personal, LAN-only tool. -->
+    <key>com.apple.security.app-sandbox</key>
+    <false/>
+</dict>
+</plist>
@@ -0,0 +1,116 @@
+import SwiftUI
+import AppKit
+
+/// The menu-bar panel: permission statuses, backend health, and a link to
+/// Settings. Shown when the user clicks the status-bar item.
+struct MenuBarView: View {
+    @EnvironmentObject private var settings: AppSettings
+    @EnvironmentObject private var permissions: PermissionsManager
+    @EnvironmentObject private var health: SparkControlHealth
+
+    var body: some View {
+        NavigationStack {
+            VStack(alignment: .leading, spacing: 12) {
+                header
+                Divider()
+                permissionsSection
+                Divider()
+                backendSection
+                Divider()
+                footer
+            }
+            .padding(14)
+            .frame(width: 320)
+        }
+        .onAppear { permissions.refresh() }
+        .task { await refreshHealth() }
+    }
+
+    private var header: some View {
+        VStack(alignment: .leading, spacing: 2) {
+            Text("Ten31 Transcripts").font(.headline)
+            Text("Phase 0 · setup & status")
+                .font(.caption)
+                .foregroundStyle(.secondary)
+        }
+    }
+
+    private var permissionsSection: some View {
+        VStack(alignment: .leading, spacing: 8) {
+            Text("Permissions").font(.subheadline).bold()
+            PermissionRow(
+                title: "Microphone",
+                state: permissions.microphone,
+                onGrant: permissions.requestMicrophone,
+                onOpenSettings: { permissions.openSettings(.microphone) }
+            )
+            PermissionRow(
+                title: "Screen Recording",
+                state: permissions.screenRecording,
+                onGrant: permissions.requestScreenRecording,
+                onOpenSettings: { permissions.openSettings(.screenRecording) }
+            )
+            PermissionRow(
+                title: "Accessibility",
+                state: permissions.accessibility,
+                onGrant: permissions.requestAccessibility,
+                onOpenSettings: { permissions.openSettings(.accessibility) }
+            )
+        }
+    }
+
+    private var backendSection: some View {
+        VStack(alignment: .leading, spacing: 8) {
+            HStack {
+                Text("Backend").font(.subheadline).bold()
+                Spacer()
+                Button("Check") { Task { await refreshHealth() } }
+                    .disabled(health.status == .checking)
+            }
+            Text(settings.backendBaseURL)
+                .font(.caption)
+                .foregroundStyle(.secondary)
+                .lineLimit(1)
+                .truncationMode(.middle)
+            HStack(spacing: 8) {
+                StatusDot(color: healthColor)
+                Text(healthText).font(.caption)
+            }
+        }
+    }
+
+    private var footer: some View {
+        HStack {
+            NavigationLink("Settings…") {
+                SettingsView()
+            }
+            Spacer()
+            Button("Quit") { NSApplication.shared.terminate(nil) }
+        }
+    }
+
+    private func refreshHealth() async {
+        await health.check(
+            baseURL: settings.backendBaseURL,
+            skipTLS: settings.skipTLSVerification
+        )
+    }
+
+    private var healthColor: Color {
+        switch health.status {
+        case .online:   return .green
+        case .offline:  return .red
+        case .checking: return .orange
+        case .unknown:  return .gray
+        }
+    }
+
+    private var healthText: String {
+        switch health.status {
+        case .unknown:            return "Not checked yet"
+        case .checking:           return "Checking…"
+        case .online(let detail): return "Online · \(detail)"
+        case .offline(let error): return "Offline · \(error)"
+        }
+    }
+}
@@ -0,0 +1,54 @@
+import SwiftUI
+
+/// Small status indicator dot.
+struct StatusDot: View {
+    let color: Color
+    var body: some View {
+        Circle().fill(color).frame(width: 9, height: 9)
+    }
+}
+
+/// One permission line: status dot, label, and a context-appropriate action.
+struct PermissionRow: View {
+    let title: String
+    let state: PermissionState
+    let onGrant: () -> Void
+    let onOpenSettings: () -> Void
+
+    var body: some View {
+        HStack(spacing: 8) {
+            StatusDot(color: dotColor)
+            Text(title)
+            Spacer()
+            actionButton
+        }
+    }
+
+    private var dotColor: Color {
+        switch state {
+        case .granted:       return .green
+        case .denied:        return .red
+        case .notDetermined: return .orange
+        }
+    }
+
+    @ViewBuilder
+    private var actionButton: some View {
+        switch state {
+        case .granted:
+            Image(systemName: "checkmark.circle.fill").foregroundStyle(.green)
+        case .notDetermined:
+            // Native prompt (Microphone). The request also registers the app.
+            Button("Grant", action: onGrant)
+        case .denied:
+            // Screen Recording / Accessibility report binary granted/denied, so
+            // "not yet asked" looks like denied. "Grant" calls the request API,
+            // which registers the app in the relevant list and shows the system
+            // prompt the first time; "Open Settings" is the manual fallback.
+            HStack(spacing: 6) {
+                Button("Grant", action: onGrant)
+                Button("Open Settings", action: onOpenSettings)
+            }
+        }
+    }
+}
@@ -0,0 +1,67 @@
+import SwiftUI
+import AppKit
+
+/// Settings panel (pushed from the menu-bar panel).
+struct SettingsView: View {
+    @EnvironmentObject private var settings: AppSettings
+
+    var body: some View {
+        Form {
+            Section("SparkControl backend") {
+                TextField("Base URL", text: $settings.backendBaseURL)
+                    .textFieldStyle(.roundedBorder)
+                Toggle("Skip TLS verification (self-signed cert)",
+                       isOn: $settings.skipTLSVerification)
+            }
+
+            Section("Output") {
+                HStack {
+                    Text(settings.outputFolderPath)
+                        .lineLimit(1)
+                        .truncationMode(.middle)
+                        .foregroundStyle(.secondary)
+                    Spacer()
+                    Button("Choose…", action: chooseFolder)
+                }
+            }
+
+            Section("Adapters") {
+                Text("Inert in Phase 0 — these toggles only persist for now.")
+                    .font(.caption)
+                    .foregroundStyle(.secondary)
+                ForEach(AppSettings.adapterKeys, id: \.key) { adapter in
+                    Toggle(adapter.label, isOn: binding(for: adapter.key))
+                }
+            }
+        }
+        .formStyle(.grouped)
+        .frame(width: 320)
+        .navigationTitle("Settings")
+    }
+
+    private func binding(for key: String) -> Binding<Bool> {
+        Binding(
+            get: { settings.adapterEnabled[key] ?? true },
+            set: { settings.adapterEnabled[key] = $0 }
+        )
+    }
+
+    private func chooseFolder() {
+        let panel = NSOpenPanel()
+        panel.canChooseDirectories = true
+        panel.canChooseFiles = false
+        panel.allowsMultipleSelection = false
+        panel.prompt = "Choose"
+        panel.directoryURL = settings.outputFolderURL
+
+        // The app is a menu-bar accessory and this is invoked from the transient
+        // MenuBarExtra(.window) popover. Use the async begin(...) API rather than
+        // runModal() — a nested modal loop can let the popover dismiss the panel
+        // out from under it. Activate first so the panel comes to the front.
+        NSApp.activate(ignoringOtherApps: true)
+        panel.begin { response in
+            guard response == .OK, let url = panel.url else { return }
+            settings.outputFolderPath = url.path
+        }
+    }
+}
@@ -0,0 +1,9 @@
+import XCTest
+
+/// Placeholder so the test target is wired and green from Phase 0.
+/// Real coverage (adapter fixture tests) arrives in Phase 3.
+final class Ten31TranscriptsTests: XCTestCase {
+    func testScaffoldBuilds() throws {
+        XCTAssertTrue(true)
+    }
+}
@@ -0,0 +1,151 @@
+# Project Brief — Ten31 Transcripts
+
+> Local macOS app that auto-detects conference calls, records local audio, and
+> produces a **visual-derived, timestamped speaker timeline** — then hands the
+> mixed audio + that timeline to the operator's **SparkControl** backend, which
+> diarizes, names the speakers (majority-overlap vote against the timeline), and
+> returns named transcript segments. A growing **voiceprint library** recovers
+> speakers even when the visual cue is missing.
+
+Master context document. Read this first, then `02_ARCHITECTURE.md`,
+`03_DATA_CONTRACTS.md`, `04_BUILD_PLAN.md`. The SparkControl API is now fully
+specified — see `03_DATA_CONTRACTS.md` (and the source `AUDIO_API.md`).
+
+---
+
+## 1. What we are building
+
+A lightweight, always-running **menu-bar app on macOS** that:
+
+1. **Detects** when the user joins a call in Google Meet, Zoom, Microsoft Teams,
+   or Signal.
+2. **Records two local audio tracks** — system audio (everyone else) and the
+   user's microphone (the user) — and **mixes them to one 16 kHz mono WAV** for
+   the backend.
+3. **Watches the call window** at ~2–4 fps and, per app, reads participant
+   **names** and the **active-speaker cue**, producing a
+   `(start, end, name, confidence)` **visual timeline** — its best guess at who
+   was talking when.
+4. **Discards every video frame after extraction.** No video is ever written to
+   disk. Only audio + the derived timeline persist locally.
+5. On call end, **POSTs the mixed audio + the visual timeline (+ the known
+   voiceprint library) to `POST /api/audio/label-merge`** on SparkControl, which
+   returns **named, speaker-attributed transcript segments** and a **voiceprint
+   per speaker**.
+6. **Persists the returned voiceprints** keyed by name, so the next call can pass
+   them as `known_voiceprints` and recover a speaker by voice when the visual cue
+   is absent (camera off, a bad OCR frame).
+
+The app's job ends at receiving and storing the named segments from SparkControl.
+**All transcription, diarization, and the name-merge happen on the backend.** Do
+not build transcription, diarization, or the merge vote in this app.
+
+## 2. Why the visual timeline still matters (the core idea)
+
+Audio diarization (NVIDIA Sortformer on the backend) is excellent at
+**segmentation** — precise speaker boundaries — but its clusters are **anonymous**
+(`Speaker_0`, `Speaker_1`…). It cannot name anyone.
+
+The screen *already knows the names*. Each app visually marks the active speaker
+(colored tile border, animated audio bars, a ring around an avatar) next to that
+person's name/initials.
+
+So responsibilities split cleanly:
+- **Audio (backend)** owns *segmentation* — the exact *when*.
+- **Visual capture (this app)** owns *identity* — the *who*.
+- **`label-merge` (backend)** fuses them: it diarizes, then assigns each cluster
+  the timeline name with the most temporal overlap. The visual track needn't be
+  perfect — it only needs to win the per-cluster vote.
+
+**New compounding layer — the voiceprint library.** Every named cluster comes
+back from the backend with a 192-dim TitaNet voiceprint. The app persists these
+keyed by name and replays them as `known_voiceprints` next time. Resolution order
+per cluster becomes: **visual overlap → voiceprint match → `Unknown_N`** (never
+mislabeled). So the screen capture *enrolls a voice library for free*, and over a
+few calls the system can name regulars even with cameras off.
+
+## 3. Hard scope boundaries
+
+**In scope (this app):**
+- Call detection for Meet / Zoom / Teams / Signal.
+- Dual-track local audio capture + mix-to-mono for the backend.
+- Low-fps window capture → OCR (names) + active-speaker cue detection.
+- Per-app "adapter" modules encapsulating each app's UI quirks.
+- Building the visual timeline; **mic-VAD self-labeling** (the mic track is the
+  user, so hot-mic spans pre-seed the user's name into the timeline).
+- Chunking long calls (~2–3 min) and calling `label-merge` **sequentially**.
+- A local **voiceprint store** (persist + replay named voiceprints).
+- Storing the backend's named transcript segments locally.
+- A minimal menu-bar UI: status, manual start/stop, recent sessions, adapter
+  toggles, backend host/health, output folder.
+
+**Out of scope (owned by the backend):**
+- Transcription, diarization, the name-merge vote, summarization/analysis.
+
+**Explicitly not doing:** saving video; cloud anything. Everything stays on the
+operator's LAN.
+
+## 4. Key decisions (now resolved against the real contract)
+
+| Decision | Choice | Why |
+|---|---|---|
+| Language / framework | Native Swift + SwiftUI menu-bar app (`LSUIElement`) | System audio, window capture, Vision all native; one codebase. |
+| Audio capture | ScreenCaptureKit (system audio) + AVFoundation (mic) | No virtual audio device; works with headphones; macOS 13+. |
+| Backend audio format | **Mixed-mono 16 kHz WAV** | Diarizer separates speakers from one mixed stream; 16 kHz is ideal. |
+| Call detection | CoreAudio "mic running somewhere" + known-app / Meet-tab heuristic | Clean live-mic signal + app disambiguation. |
+| Speaker naming | **Backend, via `POST /api/audio/label-merge`** | One call does diarize + overlap-vote naming + transcription. No client merge. |
+| Identity recovery | **Local voiceprint library** replayed as `known_voiceprints` | Recovers camera-off / OCR-missed speakers by voice; compounds over calls. |
+| Self-identity | mic-VAD → pre-seed user's name in timeline | The mic track is the user; gives the backend a strong prior + enrolls the user's voiceprint immediately. |
+| Requests | **Sequential, one audio request in flight** | Parallel audio requests trip a backend GPU race (`503 + Retry-After`). |
+| Long calls | Chunk ~2–3 min, sequential, stitch via names+voiceprints | Diarizer caps at **4 speakers/chunk**; voiceprints + names unify across chunks. |
+| Transport / TLS | `multipart/form-data`, file field `file`; self-signed Start9 cert (skip verify or trust the Root CA); **no auth on LAN** | Matches every other SparkControl endpoint. |
+| Timing | Batch after call (sync endpoints, no polling) | Endpoints are synchronous; no job/poll machinery needed. |
+
+### On forking Hyprnote
+Unchanged recommendation: the audio capture is the trivial part (~200 lines of
+native Swift) and the rest (Vision screen-reading) is native too. Build native;
+use Hyprnote's capture/detection only as reference. Fork remains an override.
+
+## 5. Target apps & identifiers
+
+| App | Join via | Bundle ID(s) | Speaking cue / names |
+|---|---|---|---|
+| Zoom | Native | `us.zoom.xos` | Colored tile border; name label in tile. |
+| Microsoft Teams | Native (new) | `com.microsoft.teams2` (new), `com.microsoft.teams` (classic) | Colored ring/border; labeled. |
+| Signal | Native (Electron) | `org.whispersystems.signal-desktop` | Ring around avatar/initials; try Accessibility names first. |
+| Google Meet | Browser tab | `com.google.Chrome`, `com.apple.Safari`, `company.thebrowser.Browser` (Arc)… | Canvas video → Vision for the cue; DOM names → Accessibility/AppleScript; confirm via active-tab URL `meet.google.com`. |
+
+Four required adapters; adding a 5th must be one new file conforming to the
+`AppAdapter` protocol.
+
+## 6. The backend (do not rebuild) — now concrete
+
+SparkControl, on the operator's Start9 LAN, fronting two DGX Sparks:
+- **STT:** NVIDIA Parakeet TDT 0.6B — `POST /v1/audio/transcriptions` (OpenAI-compatible).
+- **Diarization:** NVIDIA Sortformer 4spk-v1 — `POST /api/audio/diarize-chunk`
+  (anonymous clusters + voiceprints) and `POST /api/audio/transcribe-with-speakers`.
+- **Embeddings:** NVIDIA TitaNet (192-dim voiceprints).
+- **★ Primary endpoint for this app:** `POST /api/audio/label-merge` — diarize +
+  name from the visual timeline (+ voiceprint fallback), optionally transcribe,
+  in one synchronous call.
+- Health/discovery: `GET /api/status`, `GET /api/endpoints`, `GET /v1/models`.
+
+Full request/response shapes, curl examples, limits, and error formats are in
+`03_DATA_CONTRACTS.md`.
+
+## 7. Remaining open items (small)
+
+1. **Base URL — RESOLVED.** `https://your-spark-backend.local:62419`, also
+   `https://your-spark-backend.local:62419` (prefer the `.local` form; it survives IP
+   changes). Ship the `.local` host as the default; keep it editable in settings.
+   Service-discovery at `GET /api/endpoints`.
+2. **Send trigger** — assume auto-POST on call end; expose a "hold for review"
+   toggle if the user wants to eyeball the timeline first.
+3. **Retention** — keep the session folder after a successful hand-off, or prune
+   audio and keep only `speakers.json` + voiceprints? Default: keep everything,
+   user-configurable.
+4. **Voiceprint update policy** — overwrite vs running-average a person's stored
+   voiceprint across calls (see `02_ARCHITECTURE.md §2.9`). Start simple
+   (store/refresh latest high-confidence), refine later.
+5. **Signing** — stable identity so macOS doesn't re-prompt for permissions on
+   each rebuild.
@@ -0,0 +1,248 @@
+# Architecture — Ten31 Transcripts
+
+Companion to `01_PROJECT_BRIEF.md`. Module layout, data flow, the per-app adapter
+pattern, the macOS APIs, and the SparkControl integration (now fully specified).
+
+---
+
+## 1. High-level data flow
+
+```
+                    ┌─────────────────────────────────────────┐
+                    │              CallDetector                 │
+                    │  CoreAudio "mic running somewhere"        │
+                    │  + known-app / Meet-tab heuristic         │
+                    └───────────────┬───────────────────────────┘
+                                    │ callStarted(app, window)
+                                    ▼
+        ┌──────────────────────────────────────────────────────────┐
+        │                    SessionController                       │
+        │   owns one Session; shared t0; start/stop; on end package  │
+        └───────┬───────────────────────────┬──────────────────────┘
+                │                            │
+                ▼                            ▼
+   ┌────────────────────────┐   ┌───────────────────────────────────┐
+   │     AudioRecorder       │   │         VisualObserver            │
+   │  SCStream system audio  │   │  SCStream window frames @2–4fps    │
+   │  AVAudioEngine mic       │   │     │ (frames released, never saved)│
+   │  → mic.wav, system.wav   │   │     ▼                              │
+   │  → mixed_mono_16k.wav    │   │  AppAdapter (per app)              │
+   │  + mic VAD → self spans   │   │   OCR names + active-speaker cue   │
+   └────────────┬───────────┘   │   → SpeakerObservation              │
+                │               └──────────────┬────────────────────┘
+                │                              ▼
+                │                  ┌───────────────────────────┐
+                │   self spans ───▶│      TimelineBuilder        │
+                │                  │  debounce/coalesce + merge   │
+                │                  │  mic-VAD self spans          │
+                │                  │  → visual_timeline.json      │
+                │                  └──────────────┬────────────┘
+                │                                 │
+                ▼                                 ▼
+        ┌──────────────────────────────────────────────────────────┐
+        │                       SessionPackager                      │
+        │   mixed_mono_16k.wav + visual_timeline.json + manifest      │
+        │   + chunk plan (if call > ~3 min)                           │
+        └───────────────────────────┬──────────────────────────────┘
+                                     ▼
+                   ┌──────────────────────────────────────┐
+                   │          SparkControlClient            │
+                   │  per chunk, SEQUENTIAL:                 │
+                   │   POST /api/audio/label-merge           │
+                   │     file=chunk.wav                      │
+                   │     timeline=<chunk-local segments>     │
+                   │     known_voiceprints=<from store>      │
+                   │     transcribe=true                     │
+                   │  → named segments + per-speaker prints   │
+                   └───────────────┬───────────┬────────────┘
+                                   │           │
+                       offset+stitch│           │ fingerprints (keyed by name)
+                                   ▼           ▼
+                       ┌────────────────┐  ┌────────────────────┐
+                       │  speakers.json  │  │   VoiceprintStore   │
+                       │ (named, global) │  │  persist/replay     │
+                       └────────────────┘  └────────────────────┘
+```
+
+## 2. Modules
+
+### 2.1 `CallDetector`
+Fire `callStarted(app:window:)` / `callEnded()`.
+- **Mic active system-wide:** CoreAudio `kAudioDevicePropertyDeviceIsRunningSomewhere`
+  on the default input device (listener, not poll).
+- **App present/active:** `NSWorkspace` running/frontmost vs the bundle-ID table.
+- **Meet (browser):** when a browser is frontmost + mic live, read the active-tab
+  URL (AppleScript/Accessibility); confirm `meet.google.com`.
+- **Heuristic:** `mic_running` AND (`known_native_app_active` OR `browser+meet_tab`).
+  Debounce ~2 s open; end when mic quiet > N s and the app/tab leaves foreground
+  or quits.
+- Output: app id + the call window (`SCWindow`) for the `VisualObserver`.
+
+### 2.2 `AudioRecorder`
+- **System audio:** `SCStream` with `capturesAudio = true` (mixer-level; works with
+  headphones; no BlackHole). macOS 13+.
+- **Mic:** `AVAudioEngine` input tap.
+- Outputs: `mic.wav`, `system.wav`, and the backend deliverable
+  **`mixed_mono_16k.wav`** (mic + system summed → mono → 16 kHz PCM WAV).
+- **Shared `t0`** (`CACurrentMediaTime`) stamped once; every audio sample and
+  visual observation is relative to it. Non-negotiable — the merge depends on it.
+- **Mic VAD:** run lightweight energy/VAD on the mic track to emit "the user is
+  speaking" spans. These feed `TimelineBuilder` as pre-seeded **self** segments
+  (high confidence) so the backend names the user even when their own tile isn't
+  read — and so the user's voiceprint enrolls on call one.
+
+### 2.3 `VisualObserver`
+- `SCStream` scoped (via `SCContentFilter`) to the **specific call window**.
+- Throttle to adapter fps (default 3). Hand each frame to the adapter; **release
+  immediately — never persist a frame.**
+- **Window visibility / focus is NOT required.** SCK captures a window's own
+  rendered content even when it's in the background, occluded by other apps, or
+  on another Space. The user can work in other apps during the call and visual
+  capture continues normally. (This is a key reason for window capture over
+  display capture — also more private.)
+- **Capture liveness — the one real failure mode.** Two states stop fresh frames:
+  1. **Minimized to the Dock** — macOS may freeze the window's backing buffer, so
+     SCK delivers stale/duplicate frames. Detect minimization
+     (`SCWindow.isOnScreen == false` / window state) and **pause visual analysis +
+     flag a `visual_gap` for that span** rather than emitting bogus observations.
+  2. **Browser tab switched away (Meet only)** — see §2.4 Meet note.
+  In both cases **audio keeps recording**, and the backend voiceprint fallback
+  still names previously-heard speakers — so a gap only costs naming precision for
+  *new, never-seen* speakers during that exact window. Record gaps in
+  `visual_timeline.json` (a `visual_gaps: [{start,end,reason}]` array) so the
+  cause is auditable; `TimelineBuilder` must not interpolate across a gap.
+
+### 2.4 `AppAdapter` (protocol) + four implementations
+```swift
+protocol AppAdapter {
+    static var bundleIDs: [String] { get }
+    var preferredFPS: Int { get }
+    func analyze(frame: CVPixelBuffer, at t: TimeInterval) -> [SpeakerObservation]
+    func namesFromAccessibility() -> [String]?   // optional
+}
+struct SpeakerObservation {
+    let name: String          // OCR'd / a11y name; "" if unknown
+    let speaking: Bool        // active-speaker cue detected
+    let bbox: CGRect
+    let confidence: Double     // 0..1
+    let t: TimeInterval        // relative to session t0
+}
+```
+Per-adapter cues:
+- **Zoom** (`us.zoom.xos`): colored tile border = active speaker; OCR the tile
+  name label; handle speaker + gallery layouts.
+- **Teams** (`com.microsoft.teams2`): colored ring/border; labeled; like Zoom.
+- **Signal** (`org.whispersystems.signal-desktop`): ring around avatar/initials;
+  try `namesFromAccessibility()` first (Electron a11y tree), OCR fallback.
+- **Meet** (browser): **hybrid** — names via Accessibility/AppleScript (DOM text),
+  speaking cue via Vision (canvas/WebGL animated bars / tile highlight), fused by
+  tile position. Most likely to need iteration.
+  - **Tab-switch caveat (Meet-specific):** if Meet is a browser *tab* and the user
+    switches to a different tab **in the same window**, the browser stops rendering
+    the Meet tab → SCK captures a frozen last-frame (a `visual_gap`). Switching to a
+    different *app* is fine; switching tabs is not. Mitigations, in order: (1)
+    detect the active-tab URL leaving `meet.google.com` and flag a `visual_gap`
+    (don't emit stale observations); (2) prefer capturing Meet in a **dedicated
+    browser window / PWA / standalone window** so tab-switching can't blank it
+    — surface this as a one-time setup tip in the UI; (3) names still come from the
+    a11y/DOM tree where available, and audio + voiceprint fallback carry identity
+    through the gap regardless.
+
+Each adapter is **testable offline** against PNG/JPEG frame fixtures.
+
+### 2.5 `TimelineBuilder`
+Turn noisy per-frame observations into clean `(start, end, name, confidence)`
+segments.
+- Group by name; open a segment after K consecutive speaking frames (e.g. 2),
+  close after M quiet frames (e.g. 2) — hysteresis rides out UI-cue lag/flicker.
+- **Allow overlaps** (crosstalk). Do not force one speaker per instant.
+- Merge in the mic-VAD **self** spans (the user) with high confidence.
+- Normalize OCR name variants ("Sarah J" → "Sarah Jones") via a per-session
+  alias table.
+- Emit `visual_timeline.json` (schema in `03_DATA_CONTRACTS.md`). The flat
+  `segments` array maps directly onto the `timeline` field `label-merge` wants.
+
+### 2.6 `SessionPackager`
+Write the session folder and, if the call is longer than ~3 min, produce a
+**chunk plan**: ~2–3 min windows on `mixed_mono_16k.wav`, each with its
+**timeline slice rebased to chunk-local seconds**.
+```
+~/Ten31Transcripts/sessions/2026-06-05T14-03_zoom/
+  mic.wav  system.wav  mixed_mono_16k.wav
+  visual_timeline.json
+  manifest.json
+  (chunks/ produced transiently if chunking)
+  speakers.json            # written after backend hand-off
+```
+
+### 2.7 `SparkControlClient`
+Deliver to SparkControl. **Primary path = `POST /api/audio/label-merge`** with
+`file`, `timeline`, `known_voiceprints`, `transcribe=true`.
+- **Sequential only** — one audio request in flight (parallel ⇒ `503 + Retry-After`).
+- **Self-signed TLS** — skip verification (`URLSession` delegate trusting the
+  Start9 cert) or trust the Root CA. **No auth on the LAN.**
+- **Per chunk:** call `label-merge` with that chunk's audio + rebased timeline +
+  the **accumulated** voiceprints; offset returned timestamps back to global and
+  append. Names unify across chunks because the same names/voiceprints are passed
+  each time; new voiceprints accumulate into the store.
+- Retry on `503` after `Retry-After`; on hard failure keep the session folder and
+  surface "Resend" in the UI.
+- Limits to respect: **200 MB/request** (`413`), transcription timeout ~300 s,
+  diarization ~600 s. Chunking keeps requests well under these.
+- See `03_DATA_CONTRACTS.md §4` for exact fields and a real response.
+
+### 2.8 result assembly → `speakers.json`
+Concatenate the per-chunk `label-merge` results into one global, named,
+speaker-attributed transcript (timestamps offset to session time). This is the
+seam to the user's existing summarizer. The app does not analyze past this.
+
+### 2.9 `VoiceprintStore`
+Local persistence of named voiceprints — the compounding-identity layer.
+- File: `~/Ten31Transcripts/voiceprints.json` →
+  `{ "<name>": { "vector": [192 floats], "updated": <iso>, "calls": <int> } }`.
+- **On send:** load all entries → pass as `known_voiceprints` to `label-merge`.
+- **On response:** for each speaker resolved by **visual** (or a high-similarity
+  **voiceprint** match), store/refresh that name's vector. **Never** store
+  `Unknown_N`.
+- **Update policy (open, start simple):** overwrite with the latest
+  high-confidence vector, or keep a running mean per name. v1 = store latest with
+  `overlap_confidence ≥ ~0.8`; refine to averaging later (`01 §7.4`).
+- Editable/clearable from the menu-bar UI (rename, delete a person, reset).
+
+### 2.10 `MenuBarUI` (SwiftUI, `LSUIElement`)
+Status (idle / detected / recording / uploading), manual start/stop, recent
+sessions (open folder, resend, delete), adapter toggles, **backend host + a
+health check** (`GET /api/status`), output folder, voiceprint manager, and a
+permissions checklist (Screen Recording, Microphone, Accessibility).
+
+## 3. macOS frameworks & permissions
+
+| Need | Framework | Permission |
+|---|---|---|
+| System audio + window frames | ScreenCaptureKit | Screen Recording |
+| Microphone | AVFoundation / CoreAudio | Microphone |
+| Meet/Signal names, tab URL | Accessibility (AXUIElement) / AppleScript | Accessibility + Automation |
+| OCR + cue analysis | Vision (`VNRecognizeTextRequest`) | none |
+| App/tab detection | AppKit (`NSWorkspace`) | none |
+
+Stable signing identity avoids permission re-prompts on rebuild.
+
+## 4. Performance
+Window-scoped capture + 3 fps + Vision-on-Neural-Engine is light; audio is cheap;
+frames are released immediately so memory stays flat. The app idles near-zero
+until a call starts. Backend requests are sequential and chunked, so they never
+saturate the GPU.
+
+## 5. The merge — now done by the backend
+The app no longer implements the overlap vote. `label-merge` resolves each
+anonymous cluster in order:
+1. **visual** — timeline name with the most temporal overlap (`source: "visual"`,
+   `overlap_confidence`);
+2. **voiceprint** — closest `known_voiceprints` match above `voiceprint_threshold`
+   (`source: "voiceprint"`, `match_similarity`);
+3. **`Unknown_N`** (`source: "unmatched"`) — never guessed/mislabeled.
+The app's contribution is a good timeline (incl. mic-VAD self spans) and an
+ever-growing voiceprint library. `min_overlap` and `voiceprint_threshold` are
+tunable request params if precision needs adjusting.
+```
+```
@@ -0,0 +1,214 @@
+# Data Contracts — Ten31 Transcripts
+
+Companion to docs 01/02. Defines the files the app produces/stores and the **real
+SparkControl contract** (source of truth: `AUDIO_API.md`). The `label-merge`
+endpoint is the app's primary integration point.
+
+---
+
+## 1. `visual_timeline.json` (the app's primary output)
+
+Rich, app-native record of who-was-speaking-when, inferred from the screen.
+Times are **seconds relative to session `t0`** (the shared audio clock).
+
+```jsonc
+{
+  "schema_version": "1.1",
+  "session_id": "2026-06-05T14-03_zoom",
+  "app": "zoom",                       // zoom | teams | signal | meet
+  "adapter_version": "zoom-0.3.1",
+  "t0_unix": 1749135780.123,
+  "duration_sec": 2841.7,
+  "fps_sampled": 3,
+  "self_name": "Grant",                // user's name; segments from mic VAD
+  "participants": [
+    { "name": "Grant", "is_self": true },
+    { "name": "Sarah Jones", "aliases": ["Sarah J"] },
+    { "name": "Dmitri" }
+  ],
+  "segments": [
+    { "start": 0.0,  "end": 4.5, "name": "Grant",       "confidence": 0.97, "source": "mic_vad" },
+    { "start": 4.8,  "end": 9.3, "name": "Sarah Jones", "confidence": 0.82, "source": "vision" }
+  ],
+  "visual_gaps": [
+    { "start": 120.4, "end": 138.9, "reason": "minimized" }   // reason: minimized | tab_switched
+  ]
+}
+```
+- Segments **may overlap** (crosstalk) — do not flatten.
+- `source ∈ {vision, accessibility, fused, mic_vad}`.
+- The user (mic-VAD `is_self`) is pre-seeded so the backend names them and
+  enrolls their voiceprint on call one.
+
+### 1.1 Mapping to the `label-merge` `timeline` field
+`label-merge` wants a **flat JSON array** of `{start, end, name, confidence}`
+(seconds). Build it by taking `segments` and dropping `source`:
+```json
+[{"start":0.0,"end":4.5,"name":"Grant","confidence":0.97},
+ {"start":4.8,"end":9.3,"name":"Sarah Jones","confidence":0.82}]
+```
+When chunking, **slice to the chunk window and rebase to chunk-local seconds**
+(subtract chunk start) before sending.
+
+## 2. `manifest.json`
+```jsonc
+{
+  "session_id": "2026-06-05T14-03_zoom",
+  "app": "zoom",
+  "t0_unix": 1749135780.123,
+  "duration_sec": 2841.7,
+  "audio": {
+    "mixed":  { "file": "mixed_mono_16k.wav", "sr": 16000, "channels": 1, "sha256": "..." },
+    "mic":    { "file": "mic.wav",    "sr": 16000, "channels": 1, "sha256": "..." },
+    "system": { "file": "system.wav", "sr": 16000, "channels": 1, "sha256": "..." }
+  },
+  "chunking": { "enabled": true, "chunk_sec": 150, "chunks": 19 },
+  "visual_timeline": "visual_timeline.json",
+  "backend_result": "speakers.json",
+  "adapter_versions": { "zoom": "0.3.1" },
+  "app_version": "0.1.0"
+}
+```
+(`mixed_mono_16k.wav` is the one the backend gets; the separate tracks are kept
+locally — the mic track is the user's known identity / VAD source.)
+
+---
+
+## 3. SparkControl — connection (real)
+
+- **Base URL (confirmed):** `https://your-spark-backend.local:62419` — also reachable at
+  `https://your-spark-backend.local:62419` (the `.local` form survives IP changes;
+  **prefer it as the default**). Service-discovery JSON is at
+  `GET /api/endpoints` (returns current vLLM / Parakeet / Kokoro URLs). All audio
+  endpoints in §4–§5 hang off this base. Still **make it a setting** so the host
+  can change, but ship `https://your-spark-backend.local:62419` as the default.
+- **TLS:** Start9 self-signed Root CA. Either skip verification (`URLSession`
+  delegate trusting the cert; curl `-k`; `rejectUnauthorized:false`) **or** install
+  the Start9 Root CA into the trust store.
+- **Auth:** **none on the LAN.** No token/key today.
+- **Limits:** **200 MB/request** (`413` over); timeouts ~300 s (transcription),
+  ~600 s (diarization). **Send audio requests SEQUENTIALLY** — concurrent audio
+  trips a GPU FFT race → `503 + Retry-After`.
+- **Transport:** `multipart/form-data`, audio file field name **`file`** (bytes,
+  not base64/path).
+- **All endpoints are synchronous** (no job IDs / polling).
+- **Errors:** JSON `{"detail": "..."}`; `400` malformed, `413` too large, `503 +
+  Retry-After` transient (retry after the interval).
+- **Health/discovery:** `GET /api/status`, `GET /api/endpoints`, `GET /v1/models`.
+
+---
+
+## 4. ★ `POST /api/audio/label-merge` — the app's primary call
+
+Diarize + name clusters from the visual timeline (majority temporal overlap),
+with voiceprint fallback, optionally transcribed. Synchronous. **Stateless** —
+the app owns the timeline and the voiceprint library.
+
+**Multipart fields:**
+| field | required | notes |
+|---|---|---|
+| `file` | **yes** | mixed-mono WAV (the chunk, when chunking) |
+| `timeline` | **yes** | flat JSON array `[{"start","end","name","confidence"}]`, chunk-local seconds (§1.1) |
+| `known_voiceprints` | no | JSON `{"<name>":[192 floats], ...}` from `VoiceprintStore` |
+| `transcribe` | no | `"true"` to also return per-segment text (default false) |
+| `min_overlap` | no | min fraction of a cluster's time overlapping the winning name (default `0.0`) |
+| `voiceprint_threshold` | no | cosine sim to accept a voiceprint match (default `0.5`) |
+
+```bash
+curl -k -X POST https://<host>/api/audio/label-merge \
+  -F "file=@chunk_003.wav" \
+  -F 'timeline=[{"start":0,"end":4.5,"name":"Grant"},{"start":4.8,"end":9.3,"name":"Sarah Jones"}]' \
+  -F 'known_voiceprints={"Grant":[/*192*/],"Sarah Jones":[/*192*/]}' \
+  -F "transcribe=true"
+```
+
+**Response (`transcribe=true`):**
+```jsonc
+{
+  "duration": 9.259,
+  "speakers": [
+    { "cluster": "Speaker_0", "name": "Grant",       "source": "visual",
+      "overlap_confidence": 0.9866, "fingerprint": [/*192 floats*/] },
+    { "cluster": "Speaker_1", "name": "Sarah Jones", "source": "voiceprint",
+      "match_similarity": 0.71,     "fingerprint": [/*192 floats*/] }
+  ],
+  "segments": [
+    { "start_ms": 0,    "end_ms": 4480, "speaker": "Grant",
+      "text": "Good morning everyone. I think the energy thesis is strong this quarter." },
+    { "start_ms": 4800, "end_ms": 9040, "speaker": "Sarah Jones",
+      "text": "I agree, but I am worried about the lockup terms and the fee load this time." }
+  ],
+  "fingerprints": { "Grant": [/*192*/], "Sarah Jones": [/*192*/] },
+  "models": { "diarization": "nvidia/diar_sortformer_4spk-v1",
+              "embedding": "nvidia/speakerverification_en_titanet_large" }
+}
+```
+- `transcribe=false` → segments are `{start_s, end_s, speaker, confidence}` (no text).
+- **Resolution order per cluster:** `visual` → `voiceprint` (with
+  `match_similarity`) → `Unknown_N` (`source:"unmatched"`, never mislabeled).
+- **Persist `fingerprints`** keyed by name into `VoiceprintStore` (skip `Unknown_N`).
+- **Diarizer caps at 4 speakers/chunk** and takes **no `num_speakers` hint** — chunk
+  for larger/longer calls; names + voiceprints unify speakers across chunks.
+
+---
+
+## 5. Other SparkControl endpoints (reference / fallback)
+
+### 5.1 `POST /v1/audio/transcriptions` (OpenAI-compatible, sync)
+Plain STT. Fields: `file` (req), `model` (default `parakeet-tdt-0.6b-v3`),
+`response_format` (`json` | `verbose_json` | `text`), `language`, `temperature`,
+`prompt`. `verbose_json` returns word- **and** segment-level timestamps + full
+`text`. Use only if you ever want transcript without speaker labels.
+
+### 5.2 `POST /api/audio/diarize-chunk` (sync)
+Field: `file`. Returns anonymous clusters `{start_s, end_s, speaker, confidence}`,
+`speakers_detected`, and a 192-dim `fingerprints` map per local speaker. Use if
+you ever want to do the merge client-side instead of via `label-merge` (not the
+default path).
+
+### 5.3 `POST /api/audio/transcribe-with-speakers` (sync)
+Field: `file`. ASR + diarization merged into anonymous speaker-attributed blocks
+`{start_ms, end_ms, speaker, text}`. (Anonymous — no naming. `label-merge` is the
+named version.)
+
+---
+
+## 6. `speakers.json` — final stored output (target end state)
+
+Per-chunk `label-merge` results concatenated, **timestamps offset back to session
+(global) seconds**, names unified across chunks. This is the hand-off to the
+user's downstream summarizer; the app stops here.
+
+```jsonc
+{
+  "session_id": "2026-06-05T14-03_zoom",
+  "app": "zoom",
+  "duration_sec": 2841.7,
+  "speakers": [
+    { "name": "Grant",       "source": "visual",     "overlap_confidence": 0.99 },
+    { "name": "Sarah Jones", "source": "voiceprint", "match_similarity": 0.71 },
+    { "name": "Unknown_0",   "source": "unmatched" }
+  ],
+  "segments": [
+    { "start": 0.0,   "end": 4.48,  "speaker": "Grant",       "text": "..." },
+    { "start": 4.80,  "end": 9.04,  "speaker": "Sarah Jones", "text": "..." },
+    { "start": 152.3, "end": 158.1, "speaker": "Unknown_0",   "text": "..." }
+  ],
+  "models": { "diarization": "nvidia/diar_sortformer_4spk-v1",
+              "embedding": "nvidia/speakerverification_en_titanet_large",
+              "transcription": "parakeet-tdt-0.6b-v3" }
+}
+```
+(Convert backend `start_ms`/`end_ms` → seconds; add the chunk offset.)
+
+## 7. `voiceprints.json` — local voiceprint library
+```jsonc
+{
+  "Grant":       { "vector": [/*192 floats*/], "updated": "2026-06-05T14:51Z", "calls": 12 },
+  "Sarah Jones": { "vector": [/*192 floats*/], "updated": "2026-06-02T09:10Z", "calls": 3 }
+}
+```
+Loaded → `known_voiceprints` on every `label-merge` call. Updated from response
+`fingerprints` for `visual`/high-confidence `voiceprint` speakers only. Never
+stores `Unknown_N`. Update policy (`02 §2.9`): start = store latest with
+`overlap_confidence ≥ ~0.8`; consider per-name running mean later.
@@ -0,0 +1,123 @@
+# Build Plan — Ten31 Transcripts
+
+Companion to docs 01–03. Phased plan for the Claude Code session, each phase with
+a demoable milestone. Build in order; the risky/novel work (visual adapters) is
+isolated for independent tuning. The SparkControl contract is now known
+(`03_DATA_CONTRACTS.md`), so Phase 5 wires the real endpoints.
+
+---
+
+## Ground rules
+- **Native Swift + SwiftUI**, menu-bar app (`LSUIElement = true`). macOS 13.0+
+  (ScreenCaptureKit system audio).
+- **Never write video frames to disk.** Process in-memory, release immediately;
+  assert this in `VisualObserver` review.
+- **One shared monotonic `t0`** for audio + visual timestamps — wire first.
+- **Backend deliverable = `mixed_mono_16k.wav`.** Keep `mic.wav`/`system.wav`
+  locally (mic = the user, and the VAD source for self-labeling).
+- **SparkControl calls are SEQUENTIAL** (one audio request in flight) over
+  self-signed TLS, **no auth**, primary endpoint `POST /api/audio/label-merge`.
+- Every adapter must be testable offline against image fixtures.
+
+---
+
+## Phase 0 — Scaffold, permissions, backend ping (milestone: launches + green)
+- Menu-bar shell; permissions onboarding (Screen Recording, Microphone,
+  Accessibility) with status + deep links.
+- Settings: output folder, adapter toggles, **SparkControl base URL** + TLS-skip
+  toggle.
+- `GET /api/status` health check shown in the menu.
+- **Milestone:** app sits in the menu bar; permissions green; backend reachable.
+
+## Phase 1 — Audio capture + mix (milestone: clean mixed-mono WAV)
+- `AudioRecorder`: SCK system audio → `system.wav`; AVAudioEngine mic → `mic.wav`;
+  shared `t0`; produce **`mixed_mono_16k.wav`**.
+- Mic VAD → user "self" speaking spans (held for the timeline).
+- Manual start/stop for now.
+- **Milestone:** record any call manually → tracks line up, mixed-mono plays back
+  clean, mic-VAD spans look right.
+
+## Phase 2 — Call detection (milestone: hands-free start/stop)
+- `CallDetector`: CoreAudio mic-running listener + `NSWorkspace` app/frontmost +
+  Meet active-tab-URL check; debounce.
+- Wire detector → `SessionController` → `AudioRecorder`.
+- **Milestone:** joining a call in any of the four apps auto-starts; leaving
+  auto-stops; a quick mic test does not false-trigger.
+
+## Phase 3 — VisualObserver + Zoom adapter (milestone: visual_timeline.json)
+- `VisualObserver`: window-scoped `SCStream` @3 fps → frames → adapter → release.
+  Background/occluded windows capture fine; detect **minimized** (and, for Meet,
+  **tab-switched**) → pause analysis + record a `visual_gap`, never emit stale
+  frames. Audio keeps recording through gaps.
+- `AppAdapter` protocol + `ZoomAdapter` (tile-border cue + OCR names; speaker &
+  gallery layouts).
+- `TimelineBuilder`: hysteresis, overlap-allowed, merge mic-VAD self spans, alias
+  normalization → `visual_timeline.json` (+ the flat `timeline` array form).
+- **Offline fixtures:** ~30 sample Zoom frames; unit-test/tune thresholds without
+  a live call.
+- **Milestone:** a real Zoom call yields a `visual_timeline.json` whose segments
+  visibly match who spoke (incl. the user from mic VAD).
+
+## Phase 4 — Remaining adapters (milestone: all four produce timelines)
+- `TeamsAdapter`, `SignalAdapter` (a11y names first), `MeetAdapter` (hybrid: a11y
+  names + Vision cue). Fixtures per app; per-adapter fps/threshold config.
+- **Milestone:** each app yields a sensible timeline; adapters independently
+  toggleable.
+
+## Phase 5 — SparkControl hand-off via label-merge (milestone: end-to-end named transcript)
+- `SessionPackager`: write session folder; if call > ~3 min, build the chunk plan
+  (~2–3 min windows) with **timeline sliced + rebased to chunk-local seconds**.
+- `SparkControlClient`:
+  - Self-signed TLS handling; multipart `file`; **sequential** requests.
+  - Per chunk → `POST /api/audio/label-merge` with `file`, chunk `timeline`,
+    `known_voiceprints` (from store), `transcribe=true`.
+  - Handle `503 + Retry-After` (wait, retry), `413`, `{"detail":...}` errors.
+  - Offset returned `start_ms/end_ms` back to global seconds; concatenate.
+- Assemble `speakers.json` (named, global, speaker-attributed transcript).
+- Failed sessions stay on disk with a "Resend" affordance.
+- **Milestone:** finish a call → `mixed_mono_16k.wav` + timeline delivered → a
+  `speakers.json` with real names + accurate boundaries lands locally. Test a
+  >3-min call to exercise chunking + cross-chunk name unification.
+
+## Phase 6 — Voiceprint library (milestone: identity that compounds)
+- `VoiceprintStore` (`voiceprints.json`): load → `known_voiceprints`; on response,
+  store/refresh `fingerprint` for `visual`/high-confidence speakers; skip
+  `Unknown_N`.
+- Verify the three resolution paths against the backend: (a) **visual** name wins;
+  (b) **voiceprint** recovery for a camera-off / OCR-missed speaker; (c)
+  **`Unknown_N`** when neither matches.
+- Voiceprint manager in the UI (rename, delete, reset).
+- **Milestone:** a regular contact gets named on a later call **with their camera
+  off**, purely from a stored voiceprint.
+
+## Phase 7 — Polish (milestone: daily-driver quality)
+- Recent-sessions list (open/resend/delete); alias-table editor.
+- Optional on-screen "recording" indicator toggle.
+- Adapter health: warn when detection confidence drops sharply (catches UI
+  redesigns early); keep fixtures for quick re-tuning.
+- Login-item / launch-at-login.
+
+---
+
+## Risk register
+1. **UI fragility (highest ongoing cost):** apps redesign speaking cues / name
+   placement. Isolate all pixel/color/label logic in adapters; Phase-7
+   confidence-drop warning + fixtures keep re-tuning fast.
+2. **Active-speaker lag / crosstalk:** UI cue trails speech and flickers. Mitigated
+   by treating visual as a *prior* (the backend votes), hysteresis, and the
+   voiceprint fallback.
+3. **Meet-in-browser:** canvas video + DOM names; the hybrid adapter needs the most
+   iteration — budget extra time.
+4. **4-speaker-per-chunk cap (Sortformer):** within a single chunk, >4 simultaneous
+   speakers can't all be separated. Chunking + voiceprints + the visual timeline
+   mitigate across chunks; for rare large calls, note it as a known limit.
+5. **Sequential GPU constraint:** never fire parallel audio requests; queue them.
+6. **Permission re-prompts on rebuild:** stable signing identity.
+
+## Definition of done (v1)
+Join a call in any of the four apps → app auto-records dual-track audio, mixes to
+mono, and builds a visual speaker timeline (incl. mic-VAD self spans) → on
+hang-up the audio + timeline go to `label-merge` (chunked + sequential as needed)
+→ a `speakers.json` with **real names and accurate boundaries** is produced, the
+**voiceprint library is updated**, and **no video is ever written to disk** and
+**no transcription/diarization/merge logic lives in this app.**
@@ -0,0 +1,49 @@
+name: Ten31Transcripts
+
+options:
+  bundleIdPrefix: xyz.ten31
+  deploymentTarget:
+    macOS: "13.0"
+  createIntermediateGroups: true
+  groupSortPosition: top
+
+settings:
+  base:
+    MARKETING_VERSION: "0.1.0"
+    CURRENT_PROJECT_VERSION: "1"
+    SWIFT_VERSION: "5.0"
+    CODE_SIGN_STYLE: Automatic
+    # Leave the team empty; pick your free personal team in Xcode's
+    # Signing & Capabilities tab on first open (see README).
+    DEVELOPMENT_TEAM: ""
+
+targets:
+  Ten31Transcripts:
+    type: application
+    platform: macOS
+    sources:
+      - path: Ten31Transcripts
+    settings:
+      base:
+        PRODUCT_NAME: Ten31Transcripts
+        PRODUCT_BUNDLE_IDENTIFIER: xyz.ten31.transcripts
+        INFOPLIST_FILE: Ten31Transcripts/Support/Info.plist
+        CODE_SIGN_ENTITLEMENTS: Ten31Transcripts/Support/Ten31Transcripts.entitlements
+        GENERATE_INFOPLIST_FILE: NO
+        ENABLE_HARDENED_RUNTIME: NO
+        MACOSX_DEPLOYMENT_TARGET: "13.0"
+    scheme:
+      testTargets:
+        - Ten31TranscriptsTests
+
+  Ten31TranscriptsTests:
+    type: bundle.unit-test
+    platform: macOS
+    sources:
+      - path: Ten31TranscriptsTests
+    settings:
+      base:
+        PRODUCT_BUNDLE_IDENTIFIER: xyz.ten31.transcripts.tests
+        GENERATE_INFOPLIST_FILE: YES
+    dependencies:
+      - target: Ten31Transcripts