Phase 0: menu-bar scaffold, permissions, backend health check
Native SwiftUI menu-bar app (LSUIElement, macOS 13+), generated from project.yml via XcodeGen. Includes: - PermissionsManager (Microphone / Screen Recording / Accessibility) + UI - SparkControlHealth: GET /api/status over self-signed TLS (InsecureTrustDelegate) - AppSettings persistence (host, TLS-skip, output folder, adapter toggles) - Menu-bar panel + Settings, app sandbox & hardened runtime off (LAN tool)
This commit is contained in:
+16
@@ -0,0 +1,16 @@
|
|||||||
|
# macOS
|
||||||
|
.DS_Store
|
||||||
|
|
||||||
|
# Xcode project is generated from project.yml by XcodeGen.
|
||||||
|
# Regenerate any time with: xcodegen generate
|
||||||
|
*.xcodeproj
|
||||||
|
*.xcworkspace
|
||||||
|
xcuserdata/
|
||||||
|
DerivedData/
|
||||||
|
build/
|
||||||
|
|
||||||
|
# Swift / SwiftPM
|
||||||
|
.build/
|
||||||
|
|
||||||
|
# App output (never commit recordings or transcripts)
|
||||||
|
/Ten31Transcripts-output/
|
||||||
@@ -0,0 +1,64 @@
|
|||||||
|
# Ten31 Transcripts
|
||||||
|
|
||||||
|
Native macOS menu-bar app that auto-detects conference calls, records local audio,
|
||||||
|
builds a visual-derived speaker timeline, and hands audio + timeline to the
|
||||||
|
SparkControl backend for naming/transcription. See `docs/` for the full spec.
|
||||||
|
|
||||||
|
This repo is at **Phase 0** (scaffold, permissions, backend health check).
|
||||||
|
|
||||||
|
## One-time setup
|
||||||
|
|
||||||
|
1. **Install Xcode** from the Mac App Store (free; ~40 GB). Open it once and
|
||||||
|
accept the license prompt.
|
||||||
|
2. **Install XcodeGen** (generates the Xcode project from `project.yml`):
|
||||||
|
```sh
|
||||||
|
brew install xcodegen
|
||||||
|
```
|
||||||
|
3. **Generate the project:**
|
||||||
|
```sh
|
||||||
|
xcodegen generate
|
||||||
|
```
|
||||||
|
This creates `Ten31Transcripts.xcodeproj` (git-ignored — regenerate any time).
|
||||||
|
4. **Open it:**
|
||||||
|
```sh
|
||||||
|
open Ten31Transcripts.xcodeproj
|
||||||
|
```
|
||||||
|
5. In Xcode, select the **Ten31Transcripts** target → **Signing & Capabilities**:
|
||||||
|
- Check **Automatically manage signing**.
|
||||||
|
- For **Team**, pick your personal team (sign in with your Apple ID — free; no
|
||||||
|
paid developer account needed). A stable team keeps macOS from re-asking for
|
||||||
|
permissions on every rebuild.
|
||||||
|
6. Press **Run** (⌘R).
|
||||||
|
|
||||||
|
## What Phase 0 does
|
||||||
|
|
||||||
|
- Launches as a menu-bar-only app (no Dock icon).
|
||||||
|
- Menu panel shows live status for the three permissions it needs — **Microphone**,
|
||||||
|
**Screen Recording**, **Accessibility** — with Grant / Open Settings buttons.
|
||||||
|
- Shows a **backend health check** (`GET /api/status`) against the configured host.
|
||||||
|
- **Settings:** backend base URL, skip-TLS toggle (on by default for the
|
||||||
|
self-signed cert), output folder, and adapter toggles (inert this phase).
|
||||||
|
|
||||||
|
No audio capture, call detection, screen reading, or backend hand-off yet — those
|
||||||
|
arrive in Phases 1–6 (`docs/04_BUILD_PLAN.md`).
|
||||||
|
|
||||||
|
## Project layout
|
||||||
|
|
||||||
|
```
|
||||||
|
project.yml # XcodeGen recipe → generates the .xcodeproj
|
||||||
|
Ten31Transcripts/
|
||||||
|
App/ Ten31TranscriptsApp.swift, AppDelegate.swift
|
||||||
|
UI/ MenuBarView, SettingsView, PermissionRow
|
||||||
|
Permissions/PermissionsManager.swift
|
||||||
|
Backend/ SparkControlHealth.swift, InsecureTrustDelegate.swift
|
||||||
|
Settings/ AppSettings.swift
|
||||||
|
Support/ Info.plist, Ten31Transcripts.entitlements
|
||||||
|
Ten31TranscriptsTests/ # placeholder; real tests land in Phase 3
|
||||||
|
```
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- **App Sandbox is off** and **Hardened Runtime is off** — this is a personal,
|
||||||
|
LAN-only tool that must observe other apps. Revisit only if distributing.
|
||||||
|
- The default backend host is `https://your-spark-backend.local:62419` (editable in
|
||||||
|
Settings).
|
||||||
@@ -0,0 +1,10 @@
|
|||||||
|
import AppKit
|
||||||
|
|
||||||
|
final class AppDelegate: NSObject, NSApplicationDelegate {
|
||||||
|
func applicationDidFinishLaunching(_ notification: Notification) {
|
||||||
|
// Run as a menu-bar accessory (no Dock icon, no main window).
|
||||||
|
// LSUIElement in Info.plist already enforces this; set it explicitly too
|
||||||
|
// so behavior is unambiguous regardless of how the app is launched.
|
||||||
|
NSApp.setActivationPolicy(.accessory)
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,28 @@
|
|||||||
|
import SwiftUI
|
||||||
|
|
||||||
|
/// Menu-bar-only app entry point.
|
||||||
|
///
|
||||||
|
/// `LSUIElement` (set in Info.plist) keeps the app out of the Dock; the
|
||||||
|
/// `MenuBarExtra` scene provides the status-bar item and its panel. Phase 0 only
|
||||||
|
/// wires up permissions, settings, and a backend health check — no audio,
|
||||||
|
/// capture, or call detection yet.
|
||||||
|
@main
|
||||||
|
struct Ten31TranscriptsApp: App {
|
||||||
|
@NSApplicationDelegateAdaptor(AppDelegate.self) private var appDelegate
|
||||||
|
|
||||||
|
@StateObject private var settings = AppSettings()
|
||||||
|
@StateObject private var permissions = PermissionsManager()
|
||||||
|
@StateObject private var health = SparkControlHealth()
|
||||||
|
|
||||||
|
var body: some Scene {
|
||||||
|
MenuBarExtra {
|
||||||
|
MenuBarView()
|
||||||
|
.environmentObject(settings)
|
||||||
|
.environmentObject(permissions)
|
||||||
|
.environmentObject(health)
|
||||||
|
} label: {
|
||||||
|
Image(systemName: "waveform.circle")
|
||||||
|
}
|
||||||
|
.menuBarExtraStyle(.window)
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,24 @@
|
|||||||
|
import Foundation
|
||||||
|
|
||||||
|
/// URLSession delegate that trusts the server certificate without validation.
|
||||||
|
///
|
||||||
|
/// SparkControl sits behind a Start9 self-signed Root CA on the LAN, so default
|
||||||
|
/// trust evaluation rejects it. This delegate is used **only** when the
|
||||||
|
/// "Skip TLS verification" setting is on. It trusts any server certificate —
|
||||||
|
/// acceptable for a personal tool on a trusted local network and nothing else.
|
||||||
|
final class InsecureTrustDelegate: NSObject, URLSessionDelegate {
|
||||||
|
func urlSession(
|
||||||
|
_ session: URLSession,
|
||||||
|
didReceive challenge: URLAuthenticationChallenge,
|
||||||
|
completionHandler: @escaping (URLSession.AuthChallengeDisposition, URLCredential?) -> Void
|
||||||
|
) {
|
||||||
|
guard
|
||||||
|
challenge.protectionSpace.authenticationMethod == NSURLAuthenticationMethodServerTrust,
|
||||||
|
let serverTrust = challenge.protectionSpace.serverTrust
|
||||||
|
else {
|
||||||
|
completionHandler(.performDefaultHandling, nil)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
completionHandler(.useCredential, URLCredential(trust: serverTrust))
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,66 @@
|
|||||||
|
import Foundation
|
||||||
|
import Combine
|
||||||
|
|
||||||
|
/// Performs the Phase 0 backend reachability check: `GET {baseURL}/api/status`.
|
||||||
|
///
|
||||||
|
/// This is a thin slice — the full `SparkControlClient` (label-merge, multipart,
|
||||||
|
/// sequential queueing, retries) arrives in Phase 5.
|
||||||
|
@MainActor
|
||||||
|
final class SparkControlHealth: ObservableObject {
|
||||||
|
|
||||||
|
enum Status: Equatable {
|
||||||
|
case unknown
|
||||||
|
case checking
|
||||||
|
case online(String)
|
||||||
|
case offline(String)
|
||||||
|
}
|
||||||
|
|
||||||
|
@Published private(set) var status: Status = .unknown
|
||||||
|
@Published private(set) var lastChecked: Date?
|
||||||
|
|
||||||
|
func check(baseURL: String, skipTLS: Bool) async {
|
||||||
|
status = .checking
|
||||||
|
|
||||||
|
let trimmed = baseURL.trimmingCharacters(in: .whitespacesAndNewlines)
|
||||||
|
let base = trimmed.hasSuffix("/") ? String(trimmed.dropLast()) : trimmed
|
||||||
|
guard !base.isEmpty, let url = URL(string: base + "/api/status") else {
|
||||||
|
status = .offline("Invalid host URL")
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
let config = URLSessionConfiguration.ephemeral
|
||||||
|
config.timeoutIntervalForRequest = 8
|
||||||
|
config.waitsForConnectivity = false
|
||||||
|
|
||||||
|
let delegate: URLSessionDelegate? = skipTLS ? InsecureTrustDelegate() : nil
|
||||||
|
let session = URLSession(configuration: config, delegate: delegate, delegateQueue: nil)
|
||||||
|
defer { session.finishTasksAndInvalidate() }
|
||||||
|
|
||||||
|
do {
|
||||||
|
let (data, response) = try await session.data(from: url)
|
||||||
|
lastChecked = Date()
|
||||||
|
guard let http = response as? HTTPURLResponse else {
|
||||||
|
status = .offline("No HTTP response")
|
||||||
|
return
|
||||||
|
}
|
||||||
|
if (200..<300).contains(http.statusCode) {
|
||||||
|
status = .online(Self.summarize(data) ?? "Reachable")
|
||||||
|
} else {
|
||||||
|
status = .offline("HTTP \(http.statusCode)")
|
||||||
|
}
|
||||||
|
} catch {
|
||||||
|
lastChecked = Date()
|
||||||
|
status = .offline(error.localizedDescription)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Best-effort one-line summary of the `/api/status` body, if it's JSON.
|
||||||
|
private static func summarize(_ data: Data) -> String? {
|
||||||
|
guard let object = try? JSONSerialization.jsonObject(with: data) as? [String: Any] else {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
if let s = object["status"] as? String { return s }
|
||||||
|
if let s = object["state"] as? String { return s }
|
||||||
|
return "Reachable"
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,89 @@
|
|||||||
|
import AVFoundation
|
||||||
|
import CoreGraphics
|
||||||
|
import ApplicationServices
|
||||||
|
import AppKit
|
||||||
|
import Combine
|
||||||
|
|
||||||
|
enum PermissionState {
|
||||||
|
case granted
|
||||||
|
case denied
|
||||||
|
case notDetermined
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Tracks and requests the three TCC permissions the app needs.
|
||||||
|
///
|
||||||
|
/// - Microphone: AVFoundation authorization (has a real "not determined" state).
|
||||||
|
/// - Screen Recording: CoreGraphics preflight/request (binary granted/denied).
|
||||||
|
/// - Accessibility: AXIsProcessTrusted (binary granted/denied).
|
||||||
|
@MainActor
|
||||||
|
final class PermissionsManager: ObservableObject {
|
||||||
|
|
||||||
|
@Published private(set) var microphone: PermissionState = .notDetermined
|
||||||
|
@Published private(set) var screenRecording: PermissionState = .notDetermined
|
||||||
|
@Published private(set) var accessibility: PermissionState = .notDetermined
|
||||||
|
|
||||||
|
init() {
|
||||||
|
refresh()
|
||||||
|
}
|
||||||
|
|
||||||
|
func refresh() {
|
||||||
|
microphone = Self.microphoneState()
|
||||||
|
screenRecording = CGPreflightScreenCaptureAccess() ? .granted : .denied
|
||||||
|
accessibility = AXIsProcessTrusted() ? .granted : .denied
|
||||||
|
}
|
||||||
|
|
||||||
|
// MARK: - Requests
|
||||||
|
|
||||||
|
func requestMicrophone() {
|
||||||
|
AVCaptureDevice.requestAccess(for: .audio) { _ in
|
||||||
|
Task { @MainActor in self.refresh() }
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Triggers the system Screen Recording prompt on first call. The user must
|
||||||
|
/// still toggle the app on in System Settings; `refresh()` reflects it after.
|
||||||
|
func requestScreenRecording() {
|
||||||
|
_ = CGRequestScreenCaptureAccess()
|
||||||
|
refresh()
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Shows the Accessibility trust prompt (deep-links to the right pane).
|
||||||
|
func requestAccessibility() {
|
||||||
|
// Literal is the value of `kAXTrustedCheckOptionPrompt`; used directly to
|
||||||
|
// stay robust across SDK import shapes of that constant.
|
||||||
|
let options = ["AXTrustedCheckOptionPrompt": true] as CFDictionary
|
||||||
|
_ = AXIsProcessTrustedWithOptions(options)
|
||||||
|
refresh()
|
||||||
|
}
|
||||||
|
|
||||||
|
func openSettings(_ pane: SettingsPane) {
|
||||||
|
guard let url = URL(string: pane.urlString) else { return }
|
||||||
|
NSWorkspace.shared.open(url)
|
||||||
|
}
|
||||||
|
|
||||||
|
// MARK: - Helpers
|
||||||
|
|
||||||
|
private static func microphoneState() -> PermissionState {
|
||||||
|
switch AVCaptureDevice.authorizationStatus(for: .audio) {
|
||||||
|
case .authorized: return .granted
|
||||||
|
case .denied, .restricted: return .denied
|
||||||
|
case .notDetermined: return .notDetermined
|
||||||
|
@unknown default: return .notDetermined
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
enum SettingsPane {
|
||||||
|
case microphone
|
||||||
|
case screenRecording
|
||||||
|
case accessibility
|
||||||
|
|
||||||
|
var urlString: String {
|
||||||
|
let root = "x-apple.systempreferences:com.apple.preference.security?"
|
||||||
|
switch self {
|
||||||
|
case .microphone: return root + "Privacy_Microphone"
|
||||||
|
case .screenRecording: return root + "Privacy_ScreenCapture"
|
||||||
|
case .accessibility: return root + "Privacy_Accessibility"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,66 @@
|
|||||||
|
import Foundation
|
||||||
|
import Combine
|
||||||
|
|
||||||
|
/// User-facing settings, persisted to `UserDefaults`.
|
||||||
|
///
|
||||||
|
/// Phase 0 scope: backend host + TLS-skip, output folder, and adapter toggles.
|
||||||
|
/// The adapter toggles persist but do nothing yet (adapters arrive in Phase 3–4).
|
||||||
|
@MainActor
|
||||||
|
final class AppSettings: ObservableObject {
|
||||||
|
|
||||||
|
/// Adapters the app will eventually run, in display order.
|
||||||
|
static let adapterKeys: [(key: String, label: String)] = [
|
||||||
|
("zoom", "Zoom"),
|
||||||
|
("teams", "Microsoft Teams"),
|
||||||
|
("signal", "Signal"),
|
||||||
|
("meet", "Google Meet"),
|
||||||
|
]
|
||||||
|
|
||||||
|
@Published var backendBaseURL: String {
|
||||||
|
didSet { defaults.set(backendBaseURL, forKey: Keys.backendBaseURL) }
|
||||||
|
}
|
||||||
|
|
||||||
|
@Published var skipTLSVerification: Bool {
|
||||||
|
didSet { defaults.set(skipTLSVerification, forKey: Keys.skipTLS) }
|
||||||
|
}
|
||||||
|
|
||||||
|
@Published var outputFolderPath: String {
|
||||||
|
didSet { defaults.set(outputFolderPath, forKey: Keys.outputFolder) }
|
||||||
|
}
|
||||||
|
|
||||||
|
@Published var adapterEnabled: [String: Bool] {
|
||||||
|
didSet { defaults.set(adapterEnabled, forKey: Keys.adapterEnabled) }
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Output folder as a resolved file URL (expands a leading `~`).
|
||||||
|
var outputFolderURL: URL {
|
||||||
|
URL(fileURLWithPath: (outputFolderPath as NSString).expandingTildeInPath,
|
||||||
|
isDirectory: true)
|
||||||
|
}
|
||||||
|
|
||||||
|
private let defaults: UserDefaults
|
||||||
|
|
||||||
|
init(defaults: UserDefaults = .standard) {
|
||||||
|
self.defaults = defaults
|
||||||
|
|
||||||
|
self.backendBaseURL = defaults.string(forKey: Keys.backendBaseURL)
|
||||||
|
?? "https://your-spark-backend.local:62419"
|
||||||
|
|
||||||
|
self.skipTLSVerification = defaults.object(forKey: Keys.skipTLS) as? Bool ?? true
|
||||||
|
|
||||||
|
self.outputFolderPath = defaults.string(forKey: Keys.outputFolder)
|
||||||
|
?? "~/Ten31Transcripts"
|
||||||
|
|
||||||
|
let stored = defaults.dictionary(forKey: Keys.adapterEnabled) as? [String: Bool]
|
||||||
|
self.adapterEnabled = stored ?? Dictionary(
|
||||||
|
uniqueKeysWithValues: Self.adapterKeys.map { ($0.key, true) }
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
private enum Keys {
|
||||||
|
static let backendBaseURL = "backendBaseURL"
|
||||||
|
static let skipTLS = "skipTLSVerification"
|
||||||
|
static let outputFolder = "outputFolderPath"
|
||||||
|
static let adapterEnabled = "adapterEnabled"
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,43 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
||||||
|
<plist version="1.0">
|
||||||
|
<dict>
|
||||||
|
<key>CFBundleDevelopmentRegion</key>
|
||||||
|
<string>en</string>
|
||||||
|
<key>CFBundleExecutable</key>
|
||||||
|
<string>$(EXECUTABLE_NAME)</string>
|
||||||
|
<key>CFBundleIdentifier</key>
|
||||||
|
<string>$(PRODUCT_BUNDLE_IDENTIFIER)</string>
|
||||||
|
<key>CFBundleInfoDictionaryVersion</key>
|
||||||
|
<string>6.0</string>
|
||||||
|
<key>CFBundleName</key>
|
||||||
|
<string>$(PRODUCT_NAME)</string>
|
||||||
|
<key>CFBundleDisplayName</key>
|
||||||
|
<string>Ten31 Transcripts</string>
|
||||||
|
<key>CFBundlePackageType</key>
|
||||||
|
<string>APPL</string>
|
||||||
|
<key>CFBundleShortVersionString</key>
|
||||||
|
<string>$(MARKETING_VERSION)</string>
|
||||||
|
<key>CFBundleVersion</key>
|
||||||
|
<string>$(CURRENT_PROJECT_VERSION)</string>
|
||||||
|
<key>LSMinimumSystemVersion</key>
|
||||||
|
<string>$(MACOSX_DEPLOYMENT_TARGET)</string>
|
||||||
|
<key>LSUIElement</key>
|
||||||
|
<true/>
|
||||||
|
<key>NSPrincipalClass</key>
|
||||||
|
<string>NSApplication</string>
|
||||||
|
<key>NSHumanReadableCopyright</key>
|
||||||
|
<string>Ten31</string>
|
||||||
|
<key>NSMicrophoneUsageDescription</key>
|
||||||
|
<string>Ten31 Transcripts records your microphone during calls to build the local audio track.</string>
|
||||||
|
<key>NSAppleEventsUsageDescription</key>
|
||||||
|
<string>Ten31 Transcripts reads the active browser tab's URL to detect Google Meet calls.</string>
|
||||||
|
<key>NSLocalNetworkUsageDescription</key>
|
||||||
|
<string>Ten31 Transcripts connects to your SparkControl server on the local network.</string>
|
||||||
|
<key>NSAppTransportSecurity</key>
|
||||||
|
<dict>
|
||||||
|
<key>NSAllowsLocalNetworking</key>
|
||||||
|
<true/>
|
||||||
|
</dict>
|
||||||
|
</dict>
|
||||||
|
</plist>
|
||||||
@@ -0,0 +1,11 @@
|
|||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
||||||
|
<plist version="1.0">
|
||||||
|
<dict>
|
||||||
|
<!-- App Sandbox is intentionally OFF: the app must read other apps' windows,
|
||||||
|
audio, accessibility trees, and the browser's active-tab URL, which the
|
||||||
|
sandbox forbids. This is a personal, LAN-only tool. -->
|
||||||
|
<key>com.apple.security.app-sandbox</key>
|
||||||
|
<false/>
|
||||||
|
</dict>
|
||||||
|
</plist>
|
||||||
@@ -0,0 +1,116 @@
|
|||||||
|
import SwiftUI
|
||||||
|
import AppKit
|
||||||
|
|
||||||
|
/// The menu-bar panel: permission statuses, backend health, and a link to
|
||||||
|
/// Settings. Shown when the user clicks the status-bar item.
|
||||||
|
struct MenuBarView: View {
|
||||||
|
@EnvironmentObject private var settings: AppSettings
|
||||||
|
@EnvironmentObject private var permissions: PermissionsManager
|
||||||
|
@EnvironmentObject private var health: SparkControlHealth
|
||||||
|
|
||||||
|
var body: some View {
|
||||||
|
NavigationStack {
|
||||||
|
VStack(alignment: .leading, spacing: 12) {
|
||||||
|
header
|
||||||
|
Divider()
|
||||||
|
permissionsSection
|
||||||
|
Divider()
|
||||||
|
backendSection
|
||||||
|
Divider()
|
||||||
|
footer
|
||||||
|
}
|
||||||
|
.padding(14)
|
||||||
|
.frame(width: 320)
|
||||||
|
}
|
||||||
|
.onAppear { permissions.refresh() }
|
||||||
|
.task { await refreshHealth() }
|
||||||
|
}
|
||||||
|
|
||||||
|
private var header: some View {
|
||||||
|
VStack(alignment: .leading, spacing: 2) {
|
||||||
|
Text("Ten31 Transcripts").font(.headline)
|
||||||
|
Text("Phase 0 · setup & status")
|
||||||
|
.font(.caption)
|
||||||
|
.foregroundStyle(.secondary)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private var permissionsSection: some View {
|
||||||
|
VStack(alignment: .leading, spacing: 8) {
|
||||||
|
Text("Permissions").font(.subheadline).bold()
|
||||||
|
PermissionRow(
|
||||||
|
title: "Microphone",
|
||||||
|
state: permissions.microphone,
|
||||||
|
onGrant: permissions.requestMicrophone,
|
||||||
|
onOpenSettings: { permissions.openSettings(.microphone) }
|
||||||
|
)
|
||||||
|
PermissionRow(
|
||||||
|
title: "Screen Recording",
|
||||||
|
state: permissions.screenRecording,
|
||||||
|
onGrant: permissions.requestScreenRecording,
|
||||||
|
onOpenSettings: { permissions.openSettings(.screenRecording) }
|
||||||
|
)
|
||||||
|
PermissionRow(
|
||||||
|
title: "Accessibility",
|
||||||
|
state: permissions.accessibility,
|
||||||
|
onGrant: permissions.requestAccessibility,
|
||||||
|
onOpenSettings: { permissions.openSettings(.accessibility) }
|
||||||
|
)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private var backendSection: some View {
|
||||||
|
VStack(alignment: .leading, spacing: 8) {
|
||||||
|
HStack {
|
||||||
|
Text("Backend").font(.subheadline).bold()
|
||||||
|
Spacer()
|
||||||
|
Button("Check") { Task { await refreshHealth() } }
|
||||||
|
.disabled(health.status == .checking)
|
||||||
|
}
|
||||||
|
Text(settings.backendBaseURL)
|
||||||
|
.font(.caption)
|
||||||
|
.foregroundStyle(.secondary)
|
||||||
|
.lineLimit(1)
|
||||||
|
.truncationMode(.middle)
|
||||||
|
HStack(spacing: 8) {
|
||||||
|
StatusDot(color: healthColor)
|
||||||
|
Text(healthText).font(.caption)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private var footer: some View {
|
||||||
|
HStack {
|
||||||
|
NavigationLink("Settings…") {
|
||||||
|
SettingsView()
|
||||||
|
}
|
||||||
|
Spacer()
|
||||||
|
Button("Quit") { NSApplication.shared.terminate(nil) }
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private func refreshHealth() async {
|
||||||
|
await health.check(
|
||||||
|
baseURL: settings.backendBaseURL,
|
||||||
|
skipTLS: settings.skipTLSVerification
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
private var healthColor: Color {
|
||||||
|
switch health.status {
|
||||||
|
case .online: return .green
|
||||||
|
case .offline: return .red
|
||||||
|
case .checking: return .orange
|
||||||
|
case .unknown: return .gray
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private var healthText: String {
|
||||||
|
switch health.status {
|
||||||
|
case .unknown: return "Not checked yet"
|
||||||
|
case .checking: return "Checking…"
|
||||||
|
case .online(let detail): return "Online · \(detail)"
|
||||||
|
case .offline(let error): return "Offline · \(error)"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,54 @@
|
|||||||
|
import SwiftUI
|
||||||
|
|
||||||
|
/// Small status indicator dot.
|
||||||
|
struct StatusDot: View {
|
||||||
|
let color: Color
|
||||||
|
var body: some View {
|
||||||
|
Circle().fill(color).frame(width: 9, height: 9)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// One permission line: status dot, label, and a context-appropriate action.
|
||||||
|
struct PermissionRow: View {
|
||||||
|
let title: String
|
||||||
|
let state: PermissionState
|
||||||
|
let onGrant: () -> Void
|
||||||
|
let onOpenSettings: () -> Void
|
||||||
|
|
||||||
|
var body: some View {
|
||||||
|
HStack(spacing: 8) {
|
||||||
|
StatusDot(color: dotColor)
|
||||||
|
Text(title)
|
||||||
|
Spacer()
|
||||||
|
actionButton
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private var dotColor: Color {
|
||||||
|
switch state {
|
||||||
|
case .granted: return .green
|
||||||
|
case .denied: return .red
|
||||||
|
case .notDetermined: return .orange
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
@ViewBuilder
|
||||||
|
private var actionButton: some View {
|
||||||
|
switch state {
|
||||||
|
case .granted:
|
||||||
|
Image(systemName: "checkmark.circle.fill").foregroundStyle(.green)
|
||||||
|
case .notDetermined:
|
||||||
|
// Native prompt (Microphone). The request also registers the app.
|
||||||
|
Button("Grant", action: onGrant)
|
||||||
|
case .denied:
|
||||||
|
// Screen Recording / Accessibility report binary granted/denied, so
|
||||||
|
// "not yet asked" looks like denied. "Grant" calls the request API,
|
||||||
|
// which registers the app in the relevant list and shows the system
|
||||||
|
// prompt the first time; "Open Settings" is the manual fallback.
|
||||||
|
HStack(spacing: 6) {
|
||||||
|
Button("Grant", action: onGrant)
|
||||||
|
Button("Open Settings", action: onOpenSettings)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,67 @@
|
|||||||
|
import SwiftUI
|
||||||
|
import AppKit
|
||||||
|
|
||||||
|
/// Settings panel (pushed from the menu-bar panel).
|
||||||
|
struct SettingsView: View {
|
||||||
|
@EnvironmentObject private var settings: AppSettings
|
||||||
|
|
||||||
|
var body: some View {
|
||||||
|
Form {
|
||||||
|
Section("SparkControl backend") {
|
||||||
|
TextField("Base URL", text: $settings.backendBaseURL)
|
||||||
|
.textFieldStyle(.roundedBorder)
|
||||||
|
Toggle("Skip TLS verification (self-signed cert)",
|
||||||
|
isOn: $settings.skipTLSVerification)
|
||||||
|
}
|
||||||
|
|
||||||
|
Section("Output") {
|
||||||
|
HStack {
|
||||||
|
Text(settings.outputFolderPath)
|
||||||
|
.lineLimit(1)
|
||||||
|
.truncationMode(.middle)
|
||||||
|
.foregroundStyle(.secondary)
|
||||||
|
Spacer()
|
||||||
|
Button("Choose…", action: chooseFolder)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
Section("Adapters") {
|
||||||
|
Text("Inert in Phase 0 — these toggles only persist for now.")
|
||||||
|
.font(.caption)
|
||||||
|
.foregroundStyle(.secondary)
|
||||||
|
ForEach(AppSettings.adapterKeys, id: \.key) { adapter in
|
||||||
|
Toggle(adapter.label, isOn: binding(for: adapter.key))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
.formStyle(.grouped)
|
||||||
|
.frame(width: 320)
|
||||||
|
.navigationTitle("Settings")
|
||||||
|
}
|
||||||
|
|
||||||
|
private func binding(for key: String) -> Binding<Bool> {
|
||||||
|
Binding(
|
||||||
|
get: { settings.adapterEnabled[key] ?? true },
|
||||||
|
set: { settings.adapterEnabled[key] = $0 }
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
private func chooseFolder() {
|
||||||
|
let panel = NSOpenPanel()
|
||||||
|
panel.canChooseDirectories = true
|
||||||
|
panel.canChooseFiles = false
|
||||||
|
panel.allowsMultipleSelection = false
|
||||||
|
panel.prompt = "Choose"
|
||||||
|
panel.directoryURL = settings.outputFolderURL
|
||||||
|
|
||||||
|
// The app is a menu-bar accessory and this is invoked from the transient
|
||||||
|
// MenuBarExtra(.window) popover. Use the async begin(...) API rather than
|
||||||
|
// runModal() — a nested modal loop can let the popover dismiss the panel
|
||||||
|
// out from under it. Activate first so the panel comes to the front.
|
||||||
|
NSApp.activate(ignoringOtherApps: true)
|
||||||
|
panel.begin { response in
|
||||||
|
guard response == .OK, let url = panel.url else { return }
|
||||||
|
settings.outputFolderPath = url.path
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,9 @@
|
|||||||
|
import XCTest
|
||||||
|
|
||||||
|
/// Placeholder so the test target is wired and green from Phase 0.
|
||||||
|
/// Real coverage (adapter fixture tests) arrives in Phase 3.
|
||||||
|
final class Ten31TranscriptsTests: XCTestCase {
|
||||||
|
func testScaffoldBuilds() throws {
|
||||||
|
XCTAssertTrue(true)
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,151 @@
|
|||||||
|
# Project Brief — Ten31 Transcripts
|
||||||
|
|
||||||
|
> Local macOS app that auto-detects conference calls, records local audio, and
|
||||||
|
> produces a **visual-derived, timestamped speaker timeline** — then hands the
|
||||||
|
> mixed audio + that timeline to the operator's **SparkControl** backend, which
|
||||||
|
> diarizes, names the speakers (majority-overlap vote against the timeline), and
|
||||||
|
> returns named transcript segments. A growing **voiceprint library** recovers
|
||||||
|
> speakers even when the visual cue is missing.
|
||||||
|
|
||||||
|
Master context document. Read this first, then `02_ARCHITECTURE.md`,
|
||||||
|
`03_DATA_CONTRACTS.md`, `04_BUILD_PLAN.md`. The SparkControl API is now fully
|
||||||
|
specified — see `03_DATA_CONTRACTS.md` (and the source `AUDIO_API.md`).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. What we are building
|
||||||
|
|
||||||
|
A lightweight, always-running **menu-bar app on macOS** that:
|
||||||
|
|
||||||
|
1. **Detects** when the user joins a call in Google Meet, Zoom, Microsoft Teams,
|
||||||
|
or Signal.
|
||||||
|
2. **Records two local audio tracks** — system audio (everyone else) and the
|
||||||
|
user's microphone (the user) — and **mixes them to one 16 kHz mono WAV** for
|
||||||
|
the backend.
|
||||||
|
3. **Watches the call window** at ~2–4 fps and, per app, reads participant
|
||||||
|
**names** and the **active-speaker cue**, producing a
|
||||||
|
`(start, end, name, confidence)` **visual timeline** — its best guess at who
|
||||||
|
was talking when.
|
||||||
|
4. **Discards every video frame after extraction.** No video is ever written to
|
||||||
|
disk. Only audio + the derived timeline persist locally.
|
||||||
|
5. On call end, **POSTs the mixed audio + the visual timeline (+ the known
|
||||||
|
voiceprint library) to `POST /api/audio/label-merge`** on SparkControl, which
|
||||||
|
returns **named, speaker-attributed transcript segments** and a **voiceprint
|
||||||
|
per speaker**.
|
||||||
|
6. **Persists the returned voiceprints** keyed by name, so the next call can pass
|
||||||
|
them as `known_voiceprints` and recover a speaker by voice when the visual cue
|
||||||
|
is absent (camera off, a bad OCR frame).
|
||||||
|
|
||||||
|
The app's job ends at receiving and storing the named segments from SparkControl.
|
||||||
|
**All transcription, diarization, and the name-merge happen on the backend.** Do
|
||||||
|
not build transcription, diarization, or the merge vote in this app.
|
||||||
|
|
||||||
|
## 2. Why the visual timeline still matters (the core idea)
|
||||||
|
|
||||||
|
Audio diarization (NVIDIA Sortformer on the backend) is excellent at
|
||||||
|
**segmentation** — precise speaker boundaries — but its clusters are **anonymous**
|
||||||
|
(`Speaker_0`, `Speaker_1`…). It cannot name anyone.
|
||||||
|
|
||||||
|
The screen *already knows the names*. Each app visually marks the active speaker
|
||||||
|
(colored tile border, animated audio bars, a ring around an avatar) next to that
|
||||||
|
person's name/initials.
|
||||||
|
|
||||||
|
So responsibilities split cleanly:
|
||||||
|
- **Audio (backend)** owns *segmentation* — the exact *when*.
|
||||||
|
- **Visual capture (this app)** owns *identity* — the *who*.
|
||||||
|
- **`label-merge` (backend)** fuses them: it diarizes, then assigns each cluster
|
||||||
|
the timeline name with the most temporal overlap. The visual track needn't be
|
||||||
|
perfect — it only needs to win the per-cluster vote.
|
||||||
|
|
||||||
|
**New compounding layer — the voiceprint library.** Every named cluster comes
|
||||||
|
back from the backend with a 192-dim TitaNet voiceprint. The app persists these
|
||||||
|
keyed by name and replays them as `known_voiceprints` next time. Resolution order
|
||||||
|
per cluster becomes: **visual overlap → voiceprint match → `Unknown_N`** (never
|
||||||
|
mislabeled). So the screen capture *enrolls a voice library for free*, and over a
|
||||||
|
few calls the system can name regulars even with cameras off.
|
||||||
|
|
||||||
|
## 3. Hard scope boundaries
|
||||||
|
|
||||||
|
**In scope (this app):**
|
||||||
|
- Call detection for Meet / Zoom / Teams / Signal.
|
||||||
|
- Dual-track local audio capture + mix-to-mono for the backend.
|
||||||
|
- Low-fps window capture → OCR (names) + active-speaker cue detection.
|
||||||
|
- Per-app "adapter" modules encapsulating each app's UI quirks.
|
||||||
|
- Building the visual timeline; **mic-VAD self-labeling** (the mic track is the
|
||||||
|
user, so hot-mic spans pre-seed the user's name into the timeline).
|
||||||
|
- Chunking long calls (~2–3 min) and calling `label-merge` **sequentially**.
|
||||||
|
- A local **voiceprint store** (persist + replay named voiceprints).
|
||||||
|
- Storing the backend's named transcript segments locally.
|
||||||
|
- A minimal menu-bar UI: status, manual start/stop, recent sessions, adapter
|
||||||
|
toggles, backend host/health, output folder.
|
||||||
|
|
||||||
|
**Out of scope (owned by the backend):**
|
||||||
|
- Transcription, diarization, the name-merge vote, summarization/analysis.
|
||||||
|
|
||||||
|
**Explicitly not doing:** saving video; cloud anything. Everything stays on the
|
||||||
|
operator's LAN.
|
||||||
|
|
||||||
|
## 4. Key decisions (now resolved against the real contract)
|
||||||
|
|
||||||
|
| Decision | Choice | Why |
|
||||||
|
|---|---|---|
|
||||||
|
| Language / framework | Native Swift + SwiftUI menu-bar app (`LSUIElement`) | System audio, window capture, Vision all native; one codebase. |
|
||||||
|
| Audio capture | ScreenCaptureKit (system audio) + AVFoundation (mic) | No virtual audio device; works with headphones; macOS 13+. |
|
||||||
|
| Backend audio format | **Mixed-mono 16 kHz WAV** | Diarizer separates speakers from one mixed stream; 16 kHz is ideal. |
|
||||||
|
| Call detection | CoreAudio "mic running somewhere" + known-app / Meet-tab heuristic | Clean live-mic signal + app disambiguation. |
|
||||||
|
| Speaker naming | **Backend, via `POST /api/audio/label-merge`** | One call does diarize + overlap-vote naming + transcription. No client merge. |
|
||||||
|
| Identity recovery | **Local voiceprint library** replayed as `known_voiceprints` | Recovers camera-off / OCR-missed speakers by voice; compounds over calls. |
|
||||||
|
| Self-identity | mic-VAD → pre-seed user's name in timeline | The mic track is the user; gives the backend a strong prior + enrolls the user's voiceprint immediately. |
|
||||||
|
| Requests | **Sequential, one audio request in flight** | Parallel audio requests trip a backend GPU race (`503 + Retry-After`). |
|
||||||
|
| Long calls | Chunk ~2–3 min, sequential, stitch via names+voiceprints | Diarizer caps at **4 speakers/chunk**; voiceprints + names unify across chunks. |
|
||||||
|
| Transport / TLS | `multipart/form-data`, file field `file`; self-signed Start9 cert (skip verify or trust the Root CA); **no auth on LAN** | Matches every other SparkControl endpoint. |
|
||||||
|
| Timing | Batch after call (sync endpoints, no polling) | Endpoints are synchronous; no job/poll machinery needed. |
|
||||||
|
|
||||||
|
### On forking Hyprnote
|
||||||
|
Unchanged recommendation: the audio capture is the trivial part (~200 lines of
|
||||||
|
native Swift) and the rest (Vision screen-reading) is native too. Build native;
|
||||||
|
use Hyprnote's capture/detection only as reference. Fork remains an override.
|
||||||
|
|
||||||
|
## 5. Target apps & identifiers
|
||||||
|
|
||||||
|
| App | Join via | Bundle ID(s) | Speaking cue / names |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Zoom | Native | `us.zoom.xos` | Colored tile border; name label in tile. |
|
||||||
|
| Microsoft Teams | Native (new) | `com.microsoft.teams2` (new), `com.microsoft.teams` (classic) | Colored ring/border; labeled. |
|
||||||
|
| Signal | Native (Electron) | `org.whispersystems.signal-desktop` | Ring around avatar/initials; try Accessibility names first. |
|
||||||
|
| Google Meet | Browser tab | `com.google.Chrome`, `com.apple.Safari`, `company.thebrowser.Browser` (Arc)… | Canvas video → Vision for the cue; DOM names → Accessibility/AppleScript; confirm via active-tab URL `meet.google.com`. |
|
||||||
|
|
||||||
|
Four required adapters; adding a 5th must be one new file conforming to the
|
||||||
|
`AppAdapter` protocol.
|
||||||
|
|
||||||
|
## 6. The backend (do not rebuild) — now concrete
|
||||||
|
|
||||||
|
SparkControl, on the operator's Start9 LAN, fronting two DGX Sparks:
|
||||||
|
- **STT:** NVIDIA Parakeet TDT 0.6B — `POST /v1/audio/transcriptions` (OpenAI-compatible).
|
||||||
|
- **Diarization:** NVIDIA Sortformer 4spk-v1 — `POST /api/audio/diarize-chunk`
|
||||||
|
(anonymous clusters + voiceprints) and `POST /api/audio/transcribe-with-speakers`.
|
||||||
|
- **Embeddings:** NVIDIA TitaNet (192-dim voiceprints).
|
||||||
|
- **★ Primary endpoint for this app:** `POST /api/audio/label-merge` — diarize +
|
||||||
|
name from the visual timeline (+ voiceprint fallback), optionally transcribe,
|
||||||
|
in one synchronous call.
|
||||||
|
- Health/discovery: `GET /api/status`, `GET /api/endpoints`, `GET /v1/models`.
|
||||||
|
|
||||||
|
Full request/response shapes, curl examples, limits, and error formats are in
|
||||||
|
`03_DATA_CONTRACTS.md`.
|
||||||
|
|
||||||
|
## 7. Remaining open items (small)
|
||||||
|
|
||||||
|
1. **Base URL — RESOLVED.** `https://your-spark-backend.local:62419`, also
|
||||||
|
`https://your-spark-backend.local:62419` (prefer the `.local` form; it survives IP
|
||||||
|
changes). Ship the `.local` host as the default; keep it editable in settings.
|
||||||
|
Service-discovery at `GET /api/endpoints`.
|
||||||
|
2. **Send trigger** — assume auto-POST on call end; expose a "hold for review"
|
||||||
|
toggle if the user wants to eyeball the timeline first.
|
||||||
|
3. **Retention** — keep the session folder after a successful hand-off, or prune
|
||||||
|
audio and keep only `speakers.json` + voiceprints? Default: keep everything,
|
||||||
|
user-configurable.
|
||||||
|
4. **Voiceprint update policy** — overwrite vs running-average a person's stored
|
||||||
|
voiceprint across calls (see `02_ARCHITECTURE.md §2.9`). Start simple
|
||||||
|
(store/refresh latest high-confidence), refine later.
|
||||||
|
5. **Signing** — stable identity so macOS doesn't re-prompt for permissions on
|
||||||
|
each rebuild.
|
||||||
@@ -0,0 +1,248 @@
|
|||||||
|
# Architecture — Ten31 Transcripts
|
||||||
|
|
||||||
|
Companion to `01_PROJECT_BRIEF.md`. Module layout, data flow, the per-app adapter
|
||||||
|
pattern, the macOS APIs, and the SparkControl integration (now fully specified).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. High-level data flow
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────┐
|
||||||
|
│ CallDetector │
|
||||||
|
│ CoreAudio "mic running somewhere" │
|
||||||
|
│ + known-app / Meet-tab heuristic │
|
||||||
|
└───────────────┬───────────────────────────┘
|
||||||
|
│ callStarted(app, window)
|
||||||
|
▼
|
||||||
|
┌──────────────────────────────────────────────────────────┐
|
||||||
|
│ SessionController │
|
||||||
|
│ owns one Session; shared t0; start/stop; on end package │
|
||||||
|
└───────┬───────────────────────────┬──────────────────────┘
|
||||||
|
│ │
|
||||||
|
▼ ▼
|
||||||
|
┌────────────────────────┐ ┌───────────────────────────────────┐
|
||||||
|
│ AudioRecorder │ │ VisualObserver │
|
||||||
|
│ SCStream system audio │ │ SCStream window frames @2–4fps │
|
||||||
|
│ AVAudioEngine mic │ │ │ (frames released, never saved)│
|
||||||
|
│ → mic.wav, system.wav │ │ ▼ │
|
||||||
|
│ → mixed_mono_16k.wav │ │ AppAdapter (per app) │
|
||||||
|
│ + mic VAD → self spans │ │ OCR names + active-speaker cue │
|
||||||
|
└────────────┬───────────┘ │ → SpeakerObservation │
|
||||||
|
│ └──────────────┬────────────────────┘
|
||||||
|
│ ▼
|
||||||
|
│ ┌───────────────────────────┐
|
||||||
|
│ self spans ───▶│ TimelineBuilder │
|
||||||
|
│ │ debounce/coalesce + merge │
|
||||||
|
│ │ mic-VAD self spans │
|
||||||
|
│ │ → visual_timeline.json │
|
||||||
|
│ └──────────────┬────────────┘
|
||||||
|
│ │
|
||||||
|
▼ ▼
|
||||||
|
┌──────────────────────────────────────────────────────────┐
|
||||||
|
│ SessionPackager │
|
||||||
|
│ mixed_mono_16k.wav + visual_timeline.json + manifest │
|
||||||
|
│ + chunk plan (if call > ~3 min) │
|
||||||
|
└───────────────────────────┬──────────────────────────────┘
|
||||||
|
▼
|
||||||
|
┌──────────────────────────────────────┐
|
||||||
|
│ SparkControlClient │
|
||||||
|
│ per chunk, SEQUENTIAL: │
|
||||||
|
│ POST /api/audio/label-merge │
|
||||||
|
│ file=chunk.wav │
|
||||||
|
│ timeline=<chunk-local segments> │
|
||||||
|
│ known_voiceprints=<from store> │
|
||||||
|
│ transcribe=true │
|
||||||
|
│ → named segments + per-speaker prints │
|
||||||
|
└───────────────┬───────────┬────────────┘
|
||||||
|
│ │
|
||||||
|
offset+stitch│ │ fingerprints (keyed by name)
|
||||||
|
▼ ▼
|
||||||
|
┌────────────────┐ ┌────────────────────┐
|
||||||
|
│ speakers.json │ │ VoiceprintStore │
|
||||||
|
│ (named, global) │ │ persist/replay │
|
||||||
|
└────────────────┘ └────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## 2. Modules
|
||||||
|
|
||||||
|
### 2.1 `CallDetector`
|
||||||
|
Fire `callStarted(app:window:)` / `callEnded()`.
|
||||||
|
- **Mic active system-wide:** CoreAudio `kAudioDevicePropertyDeviceIsRunningSomewhere`
|
||||||
|
on the default input device (listener, not poll).
|
||||||
|
- **App present/active:** `NSWorkspace` running/frontmost vs the bundle-ID table.
|
||||||
|
- **Meet (browser):** when a browser is frontmost + mic live, read the active-tab
|
||||||
|
URL (AppleScript/Accessibility); confirm `meet.google.com`.
|
||||||
|
- **Heuristic:** `mic_running` AND (`known_native_app_active` OR `browser+meet_tab`).
|
||||||
|
Debounce ~2 s open; end when mic quiet > N s and the app/tab leaves foreground
|
||||||
|
or quits.
|
||||||
|
- Output: app id + the call window (`SCWindow`) for the `VisualObserver`.
|
||||||
|
|
||||||
|
### 2.2 `AudioRecorder`
|
||||||
|
- **System audio:** `SCStream` with `capturesAudio = true` (mixer-level; works with
|
||||||
|
headphones; no BlackHole). macOS 13+.
|
||||||
|
- **Mic:** `AVAudioEngine` input tap.
|
||||||
|
- Outputs: `mic.wav`, `system.wav`, and the backend deliverable
|
||||||
|
**`mixed_mono_16k.wav`** (mic + system summed → mono → 16 kHz PCM WAV).
|
||||||
|
- **Shared `t0`** (`CACurrentMediaTime`) stamped once; every audio sample and
|
||||||
|
visual observation is relative to it. Non-negotiable — the merge depends on it.
|
||||||
|
- **Mic VAD:** run lightweight energy/VAD on the mic track to emit "the user is
|
||||||
|
speaking" spans. These feed `TimelineBuilder` as pre-seeded **self** segments
|
||||||
|
(high confidence) so the backend names the user even when their own tile isn't
|
||||||
|
read — and so the user's voiceprint enrolls on call one.
|
||||||
|
|
||||||
|
### 2.3 `VisualObserver`
|
||||||
|
- `SCStream` scoped (via `SCContentFilter`) to the **specific call window**.
|
||||||
|
- Throttle to adapter fps (default 3). Hand each frame to the adapter; **release
|
||||||
|
immediately — never persist a frame.**
|
||||||
|
- **Window visibility / focus is NOT required.** SCK captures a window's own
|
||||||
|
rendered content even when it's in the background, occluded by other apps, or
|
||||||
|
on another Space. The user can work in other apps during the call and visual
|
||||||
|
capture continues normally. (This is a key reason for window capture over
|
||||||
|
display capture — also more private.)
|
||||||
|
- **Capture liveness — the one real failure mode.** Two states stop fresh frames:
|
||||||
|
1. **Minimized to the Dock** — macOS may freeze the window's backing buffer, so
|
||||||
|
SCK delivers stale/duplicate frames. Detect minimization
|
||||||
|
(`SCWindow.isOnScreen == false` / window state) and **pause visual analysis +
|
||||||
|
flag a `visual_gap` for that span** rather than emitting bogus observations.
|
||||||
|
2. **Browser tab switched away (Meet only)** — see §2.4 Meet note.
|
||||||
|
In both cases **audio keeps recording**, and the backend voiceprint fallback
|
||||||
|
still names previously-heard speakers — so a gap only costs naming precision for
|
||||||
|
*new, never-seen* speakers during that exact window. Record gaps in
|
||||||
|
`visual_timeline.json` (a `visual_gaps: [{start,end,reason}]` array) so the
|
||||||
|
cause is auditable; `TimelineBuilder` must not interpolate across a gap.
|
||||||
|
|
||||||
|
### 2.4 `AppAdapter` (protocol) + four implementations
|
||||||
|
```swift
|
||||||
|
protocol AppAdapter {
|
||||||
|
static var bundleIDs: [String] { get }
|
||||||
|
var preferredFPS: Int { get }
|
||||||
|
func analyze(frame: CVPixelBuffer, at t: TimeInterval) -> [SpeakerObservation]
|
||||||
|
func namesFromAccessibility() -> [String]? // optional
|
||||||
|
}
|
||||||
|
struct SpeakerObservation {
|
||||||
|
let name: String // OCR'd / a11y name; "" if unknown
|
||||||
|
let speaking: Bool // active-speaker cue detected
|
||||||
|
let bbox: CGRect
|
||||||
|
let confidence: Double // 0..1
|
||||||
|
let t: TimeInterval // relative to session t0
|
||||||
|
}
|
||||||
|
```
|
||||||
|
Per-adapter cues:
|
||||||
|
- **Zoom** (`us.zoom.xos`): colored tile border = active speaker; OCR the tile
|
||||||
|
name label; handle speaker + gallery layouts.
|
||||||
|
- **Teams** (`com.microsoft.teams2`): colored ring/border; labeled; like Zoom.
|
||||||
|
- **Signal** (`org.whispersystems.signal-desktop`): ring around avatar/initials;
|
||||||
|
try `namesFromAccessibility()` first (Electron a11y tree), OCR fallback.
|
||||||
|
- **Meet** (browser): **hybrid** — names via Accessibility/AppleScript (DOM text),
|
||||||
|
speaking cue via Vision (canvas/WebGL animated bars / tile highlight), fused by
|
||||||
|
tile position. Most likely to need iteration.
|
||||||
|
- **Tab-switch caveat (Meet-specific):** if Meet is a browser *tab* and the user
|
||||||
|
switches to a different tab **in the same window**, the browser stops rendering
|
||||||
|
the Meet tab → SCK captures a frozen last-frame (a `visual_gap`). Switching to a
|
||||||
|
different *app* is fine; switching tabs is not. Mitigations, in order: (1)
|
||||||
|
detect the active-tab URL leaving `meet.google.com` and flag a `visual_gap`
|
||||||
|
(don't emit stale observations); (2) prefer capturing Meet in a **dedicated
|
||||||
|
browser window / PWA / standalone window** so tab-switching can't blank it
|
||||||
|
— surface this as a one-time setup tip in the UI; (3) names still come from the
|
||||||
|
a11y/DOM tree where available, and audio + voiceprint fallback carry identity
|
||||||
|
through the gap regardless.
|
||||||
|
|
||||||
|
Each adapter is **testable offline** against PNG/JPEG frame fixtures.
|
||||||
|
|
||||||
|
### 2.5 `TimelineBuilder`
|
||||||
|
Turn noisy per-frame observations into clean `(start, end, name, confidence)`
|
||||||
|
segments.
|
||||||
|
- Group by name; open a segment after K consecutive speaking frames (e.g. 2),
|
||||||
|
close after M quiet frames (e.g. 2) — hysteresis rides out UI-cue lag/flicker.
|
||||||
|
- **Allow overlaps** (crosstalk). Do not force one speaker per instant.
|
||||||
|
- Merge in the mic-VAD **self** spans (the user) with high confidence.
|
||||||
|
- Normalize OCR name variants ("Sarah J" → "Sarah Jones") via a per-session
|
||||||
|
alias table.
|
||||||
|
- Emit `visual_timeline.json` (schema in `03_DATA_CONTRACTS.md`). The flat
|
||||||
|
`segments` array maps directly onto the `timeline` field `label-merge` wants.
|
||||||
|
|
||||||
|
### 2.6 `SessionPackager`
|
||||||
|
Write the session folder and, if the call is longer than ~3 min, produce a
|
||||||
|
**chunk plan**: ~2–3 min windows on `mixed_mono_16k.wav`, each with its
|
||||||
|
**timeline slice rebased to chunk-local seconds**.
|
||||||
|
```
|
||||||
|
~/Ten31Transcripts/sessions/2026-06-05T14-03_zoom/
|
||||||
|
mic.wav system.wav mixed_mono_16k.wav
|
||||||
|
visual_timeline.json
|
||||||
|
manifest.json
|
||||||
|
(chunks/ produced transiently if chunking)
|
||||||
|
speakers.json # written after backend hand-off
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.7 `SparkControlClient`
|
||||||
|
Deliver to SparkControl. **Primary path = `POST /api/audio/label-merge`** with
|
||||||
|
`file`, `timeline`, `known_voiceprints`, `transcribe=true`.
|
||||||
|
- **Sequential only** — one audio request in flight (parallel ⇒ `503 + Retry-After`).
|
||||||
|
- **Self-signed TLS** — skip verification (`URLSession` delegate trusting the
|
||||||
|
Start9 cert) or trust the Root CA. **No auth on the LAN.**
|
||||||
|
- **Per chunk:** call `label-merge` with that chunk's audio + rebased timeline +
|
||||||
|
the **accumulated** voiceprints; offset returned timestamps back to global and
|
||||||
|
append. Names unify across chunks because the same names/voiceprints are passed
|
||||||
|
each time; new voiceprints accumulate into the store.
|
||||||
|
- Retry on `503` after `Retry-After`; on hard failure keep the session folder and
|
||||||
|
surface "Resend" in the UI.
|
||||||
|
- Limits to respect: **200 MB/request** (`413`), transcription timeout ~300 s,
|
||||||
|
diarization ~600 s. Chunking keeps requests well under these.
|
||||||
|
- See `03_DATA_CONTRACTS.md §4` for exact fields and a real response.
|
||||||
|
|
||||||
|
### 2.8 result assembly → `speakers.json`
|
||||||
|
Concatenate the per-chunk `label-merge` results into one global, named,
|
||||||
|
speaker-attributed transcript (timestamps offset to session time). This is the
|
||||||
|
seam to the user's existing summarizer. The app does not analyze past this.
|
||||||
|
|
||||||
|
### 2.9 `VoiceprintStore`
|
||||||
|
Local persistence of named voiceprints — the compounding-identity layer.
|
||||||
|
- File: `~/Ten31Transcripts/voiceprints.json` →
|
||||||
|
`{ "<name>": { "vector": [192 floats], "updated": <iso>, "calls": <int> } }`.
|
||||||
|
- **On send:** load all entries → pass as `known_voiceprints` to `label-merge`.
|
||||||
|
- **On response:** for each speaker resolved by **visual** (or a high-similarity
|
||||||
|
**voiceprint** match), store/refresh that name's vector. **Never** store
|
||||||
|
`Unknown_N`.
|
||||||
|
- **Update policy (open, start simple):** overwrite with the latest
|
||||||
|
high-confidence vector, or keep a running mean per name. v1 = store latest with
|
||||||
|
`overlap_confidence ≥ ~0.8`; refine to averaging later (`01 §7.4`).
|
||||||
|
- Editable/clearable from the menu-bar UI (rename, delete a person, reset).
|
||||||
|
|
||||||
|
### 2.10 `MenuBarUI` (SwiftUI, `LSUIElement`)
|
||||||
|
Status (idle / detected / recording / uploading), manual start/stop, recent
|
||||||
|
sessions (open folder, resend, delete), adapter toggles, **backend host + a
|
||||||
|
health check** (`GET /api/status`), output folder, voiceprint manager, and a
|
||||||
|
permissions checklist (Screen Recording, Microphone, Accessibility).
|
||||||
|
|
||||||
|
## 3. macOS frameworks & permissions
|
||||||
|
|
||||||
|
| Need | Framework | Permission |
|
||||||
|
|---|---|---|
|
||||||
|
| System audio + window frames | ScreenCaptureKit | Screen Recording |
|
||||||
|
| Microphone | AVFoundation / CoreAudio | Microphone |
|
||||||
|
| Meet/Signal names, tab URL | Accessibility (AXUIElement) / AppleScript | Accessibility + Automation |
|
||||||
|
| OCR + cue analysis | Vision (`VNRecognizeTextRequest`) | none |
|
||||||
|
| App/tab detection | AppKit (`NSWorkspace`) | none |
|
||||||
|
|
||||||
|
Stable signing identity avoids permission re-prompts on rebuild.
|
||||||
|
|
||||||
|
## 4. Performance
|
||||||
|
Window-scoped capture + 3 fps + Vision-on-Neural-Engine is light; audio is cheap;
|
||||||
|
frames are released immediately so memory stays flat. The app idles near-zero
|
||||||
|
until a call starts. Backend requests are sequential and chunked, so they never
|
||||||
|
saturate the GPU.
|
||||||
|
|
||||||
|
## 5. The merge — now done by the backend
|
||||||
|
The app no longer implements the overlap vote. `label-merge` resolves each
|
||||||
|
anonymous cluster in order:
|
||||||
|
1. **visual** — timeline name with the most temporal overlap (`source: "visual"`,
|
||||||
|
`overlap_confidence`);
|
||||||
|
2. **voiceprint** — closest `known_voiceprints` match above `voiceprint_threshold`
|
||||||
|
(`source: "voiceprint"`, `match_similarity`);
|
||||||
|
3. **`Unknown_N`** (`source: "unmatched"`) — never guessed/mislabeled.
|
||||||
|
The app's contribution is a good timeline (incl. mic-VAD self spans) and an
|
||||||
|
ever-growing voiceprint library. `min_overlap` and `voiceprint_threshold` are
|
||||||
|
tunable request params if precision needs adjusting.
|
||||||
|
```
|
||||||
|
```
|
||||||
@@ -0,0 +1,214 @@
|
|||||||
|
# Data Contracts — Ten31 Transcripts
|
||||||
|
|
||||||
|
Companion to docs 01/02. Defines the files the app produces/stores and the **real
|
||||||
|
SparkControl contract** (source of truth: `AUDIO_API.md`). The `label-merge`
|
||||||
|
endpoint is the app's primary integration point.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. `visual_timeline.json` (the app's primary output)
|
||||||
|
|
||||||
|
Rich, app-native record of who-was-speaking-when, inferred from the screen.
|
||||||
|
Times are **seconds relative to session `t0`** (the shared audio clock).
|
||||||
|
|
||||||
|
```jsonc
|
||||||
|
{
|
||||||
|
"schema_version": "1.1",
|
||||||
|
"session_id": "2026-06-05T14-03_zoom",
|
||||||
|
"app": "zoom", // zoom | teams | signal | meet
|
||||||
|
"adapter_version": "zoom-0.3.1",
|
||||||
|
"t0_unix": 1749135780.123,
|
||||||
|
"duration_sec": 2841.7,
|
||||||
|
"fps_sampled": 3,
|
||||||
|
"self_name": "Grant", // user's name; segments from mic VAD
|
||||||
|
"participants": [
|
||||||
|
{ "name": "Grant", "is_self": true },
|
||||||
|
{ "name": "Sarah Jones", "aliases": ["Sarah J"] },
|
||||||
|
{ "name": "Dmitri" }
|
||||||
|
],
|
||||||
|
"segments": [
|
||||||
|
{ "start": 0.0, "end": 4.5, "name": "Grant", "confidence": 0.97, "source": "mic_vad" },
|
||||||
|
{ "start": 4.8, "end": 9.3, "name": "Sarah Jones", "confidence": 0.82, "source": "vision" }
|
||||||
|
],
|
||||||
|
"visual_gaps": [
|
||||||
|
{ "start": 120.4, "end": 138.9, "reason": "minimized" } // reason: minimized | tab_switched
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
- Segments **may overlap** (crosstalk) — do not flatten.
|
||||||
|
- `source ∈ {vision, accessibility, fused, mic_vad}`.
|
||||||
|
- The user (mic-VAD `is_self`) is pre-seeded so the backend names them and
|
||||||
|
enrolls their voiceprint on call one.
|
||||||
|
|
||||||
|
### 1.1 Mapping to the `label-merge` `timeline` field
|
||||||
|
`label-merge` wants a **flat JSON array** of `{start, end, name, confidence}`
|
||||||
|
(seconds). Build it by taking `segments` and dropping `source`:
|
||||||
|
```json
|
||||||
|
[{"start":0.0,"end":4.5,"name":"Grant","confidence":0.97},
|
||||||
|
{"start":4.8,"end":9.3,"name":"Sarah Jones","confidence":0.82}]
|
||||||
|
```
|
||||||
|
When chunking, **slice to the chunk window and rebase to chunk-local seconds**
|
||||||
|
(subtract chunk start) before sending.
|
||||||
|
|
||||||
|
## 2. `manifest.json`
|
||||||
|
```jsonc
|
||||||
|
{
|
||||||
|
"session_id": "2026-06-05T14-03_zoom",
|
||||||
|
"app": "zoom",
|
||||||
|
"t0_unix": 1749135780.123,
|
||||||
|
"duration_sec": 2841.7,
|
||||||
|
"audio": {
|
||||||
|
"mixed": { "file": "mixed_mono_16k.wav", "sr": 16000, "channels": 1, "sha256": "..." },
|
||||||
|
"mic": { "file": "mic.wav", "sr": 16000, "channels": 1, "sha256": "..." },
|
||||||
|
"system": { "file": "system.wav", "sr": 16000, "channels": 1, "sha256": "..." }
|
||||||
|
},
|
||||||
|
"chunking": { "enabled": true, "chunk_sec": 150, "chunks": 19 },
|
||||||
|
"visual_timeline": "visual_timeline.json",
|
||||||
|
"backend_result": "speakers.json",
|
||||||
|
"adapter_versions": { "zoom": "0.3.1" },
|
||||||
|
"app_version": "0.1.0"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
(`mixed_mono_16k.wav` is the one the backend gets; the separate tracks are kept
|
||||||
|
locally — the mic track is the user's known identity / VAD source.)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. SparkControl — connection (real)
|
||||||
|
|
||||||
|
- **Base URL (confirmed):** `https://your-spark-backend.local:62419` — also reachable at
|
||||||
|
`https://your-spark-backend.local:62419` (the `.local` form survives IP changes;
|
||||||
|
**prefer it as the default**). Service-discovery JSON is at
|
||||||
|
`GET /api/endpoints` (returns current vLLM / Parakeet / Kokoro URLs). All audio
|
||||||
|
endpoints in §4–§5 hang off this base. Still **make it a setting** so the host
|
||||||
|
can change, but ship `https://your-spark-backend.local:62419` as the default.
|
||||||
|
- **TLS:** Start9 self-signed Root CA. Either skip verification (`URLSession`
|
||||||
|
delegate trusting the cert; curl `-k`; `rejectUnauthorized:false`) **or** install
|
||||||
|
the Start9 Root CA into the trust store.
|
||||||
|
- **Auth:** **none on the LAN.** No token/key today.
|
||||||
|
- **Limits:** **200 MB/request** (`413` over); timeouts ~300 s (transcription),
|
||||||
|
~600 s (diarization). **Send audio requests SEQUENTIALLY** — concurrent audio
|
||||||
|
trips a GPU FFT race → `503 + Retry-After`.
|
||||||
|
- **Transport:** `multipart/form-data`, audio file field name **`file`** (bytes,
|
||||||
|
not base64/path).
|
||||||
|
- **All endpoints are synchronous** (no job IDs / polling).
|
||||||
|
- **Errors:** JSON `{"detail": "..."}`; `400` malformed, `413` too large, `503 +
|
||||||
|
Retry-After` transient (retry after the interval).
|
||||||
|
- **Health/discovery:** `GET /api/status`, `GET /api/endpoints`, `GET /v1/models`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. ★ `POST /api/audio/label-merge` — the app's primary call
|
||||||
|
|
||||||
|
Diarize + name clusters from the visual timeline (majority temporal overlap),
|
||||||
|
with voiceprint fallback, optionally transcribed. Synchronous. **Stateless** —
|
||||||
|
the app owns the timeline and the voiceprint library.
|
||||||
|
|
||||||
|
**Multipart fields:**
|
||||||
|
| field | required | notes |
|
||||||
|
|---|---|---|
|
||||||
|
| `file` | **yes** | mixed-mono WAV (the chunk, when chunking) |
|
||||||
|
| `timeline` | **yes** | flat JSON array `[{"start","end","name","confidence"}]`, chunk-local seconds (§1.1) |
|
||||||
|
| `known_voiceprints` | no | JSON `{"<name>":[192 floats], ...}` from `VoiceprintStore` |
|
||||||
|
| `transcribe` | no | `"true"` to also return per-segment text (default false) |
|
||||||
|
| `min_overlap` | no | min fraction of a cluster's time overlapping the winning name (default `0.0`) |
|
||||||
|
| `voiceprint_threshold` | no | cosine sim to accept a voiceprint match (default `0.5`) |
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -k -X POST https://<host>/api/audio/label-merge \
|
||||||
|
-F "file=@chunk_003.wav" \
|
||||||
|
-F 'timeline=[{"start":0,"end":4.5,"name":"Grant"},{"start":4.8,"end":9.3,"name":"Sarah Jones"}]' \
|
||||||
|
-F 'known_voiceprints={"Grant":[/*192*/],"Sarah Jones":[/*192*/]}' \
|
||||||
|
-F "transcribe=true"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Response (`transcribe=true`):**
|
||||||
|
```jsonc
|
||||||
|
{
|
||||||
|
"duration": 9.259,
|
||||||
|
"speakers": [
|
||||||
|
{ "cluster": "Speaker_0", "name": "Grant", "source": "visual",
|
||||||
|
"overlap_confidence": 0.9866, "fingerprint": [/*192 floats*/] },
|
||||||
|
{ "cluster": "Speaker_1", "name": "Sarah Jones", "source": "voiceprint",
|
||||||
|
"match_similarity": 0.71, "fingerprint": [/*192 floats*/] }
|
||||||
|
],
|
||||||
|
"segments": [
|
||||||
|
{ "start_ms": 0, "end_ms": 4480, "speaker": "Grant",
|
||||||
|
"text": "Good morning everyone. I think the energy thesis is strong this quarter." },
|
||||||
|
{ "start_ms": 4800, "end_ms": 9040, "speaker": "Sarah Jones",
|
||||||
|
"text": "I agree, but I am worried about the lockup terms and the fee load this time." }
|
||||||
|
],
|
||||||
|
"fingerprints": { "Grant": [/*192*/], "Sarah Jones": [/*192*/] },
|
||||||
|
"models": { "diarization": "nvidia/diar_sortformer_4spk-v1",
|
||||||
|
"embedding": "nvidia/speakerverification_en_titanet_large" }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
- `transcribe=false` → segments are `{start_s, end_s, speaker, confidence}` (no text).
|
||||||
|
- **Resolution order per cluster:** `visual` → `voiceprint` (with
|
||||||
|
`match_similarity`) → `Unknown_N` (`source:"unmatched"`, never mislabeled).
|
||||||
|
- **Persist `fingerprints`** keyed by name into `VoiceprintStore` (skip `Unknown_N`).
|
||||||
|
- **Diarizer caps at 4 speakers/chunk** and takes **no `num_speakers` hint** — chunk
|
||||||
|
for larger/longer calls; names + voiceprints unify speakers across chunks.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Other SparkControl endpoints (reference / fallback)
|
||||||
|
|
||||||
|
### 5.1 `POST /v1/audio/transcriptions` (OpenAI-compatible, sync)
|
||||||
|
Plain STT. Fields: `file` (req), `model` (default `parakeet-tdt-0.6b-v3`),
|
||||||
|
`response_format` (`json` | `verbose_json` | `text`), `language`, `temperature`,
|
||||||
|
`prompt`. `verbose_json` returns word- **and** segment-level timestamps + full
|
||||||
|
`text`. Use only if you ever want transcript without speaker labels.
|
||||||
|
|
||||||
|
### 5.2 `POST /api/audio/diarize-chunk` (sync)
|
||||||
|
Field: `file`. Returns anonymous clusters `{start_s, end_s, speaker, confidence}`,
|
||||||
|
`speakers_detected`, and a 192-dim `fingerprints` map per local speaker. Use if
|
||||||
|
you ever want to do the merge client-side instead of via `label-merge` (not the
|
||||||
|
default path).
|
||||||
|
|
||||||
|
### 5.3 `POST /api/audio/transcribe-with-speakers` (sync)
|
||||||
|
Field: `file`. ASR + diarization merged into anonymous speaker-attributed blocks
|
||||||
|
`{start_ms, end_ms, speaker, text}`. (Anonymous — no naming. `label-merge` is the
|
||||||
|
named version.)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. `speakers.json` — final stored output (target end state)
|
||||||
|
|
||||||
|
Per-chunk `label-merge` results concatenated, **timestamps offset back to session
|
||||||
|
(global) seconds**, names unified across chunks. This is the hand-off to the
|
||||||
|
user's downstream summarizer; the app stops here.
|
||||||
|
|
||||||
|
```jsonc
|
||||||
|
{
|
||||||
|
"session_id": "2026-06-05T14-03_zoom",
|
||||||
|
"app": "zoom",
|
||||||
|
"duration_sec": 2841.7,
|
||||||
|
"speakers": [
|
||||||
|
{ "name": "Grant", "source": "visual", "overlap_confidence": 0.99 },
|
||||||
|
{ "name": "Sarah Jones", "source": "voiceprint", "match_similarity": 0.71 },
|
||||||
|
{ "name": "Unknown_0", "source": "unmatched" }
|
||||||
|
],
|
||||||
|
"segments": [
|
||||||
|
{ "start": 0.0, "end": 4.48, "speaker": "Grant", "text": "..." },
|
||||||
|
{ "start": 4.80, "end": 9.04, "speaker": "Sarah Jones", "text": "..." },
|
||||||
|
{ "start": 152.3, "end": 158.1, "speaker": "Unknown_0", "text": "..." }
|
||||||
|
],
|
||||||
|
"models": { "diarization": "nvidia/diar_sortformer_4spk-v1",
|
||||||
|
"embedding": "nvidia/speakerverification_en_titanet_large",
|
||||||
|
"transcription": "parakeet-tdt-0.6b-v3" }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
(Convert backend `start_ms`/`end_ms` → seconds; add the chunk offset.)
|
||||||
|
|
||||||
|
## 7. `voiceprints.json` — local voiceprint library
|
||||||
|
```jsonc
|
||||||
|
{
|
||||||
|
"Grant": { "vector": [/*192 floats*/], "updated": "2026-06-05T14:51Z", "calls": 12 },
|
||||||
|
"Sarah Jones": { "vector": [/*192 floats*/], "updated": "2026-06-02T09:10Z", "calls": 3 }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
Loaded → `known_voiceprints` on every `label-merge` call. Updated from response
|
||||||
|
`fingerprints` for `visual`/high-confidence `voiceprint` speakers only. Never
|
||||||
|
stores `Unknown_N`. Update policy (`02 §2.9`): start = store latest with
|
||||||
|
`overlap_confidence ≥ ~0.8`; consider per-name running mean later.
|
||||||
@@ -0,0 +1,123 @@
|
|||||||
|
# Build Plan — Ten31 Transcripts
|
||||||
|
|
||||||
|
Companion to docs 01–03. Phased plan for the Claude Code session, each phase with
|
||||||
|
a demoable milestone. Build in order; the risky/novel work (visual adapters) is
|
||||||
|
isolated for independent tuning. The SparkControl contract is now known
|
||||||
|
(`03_DATA_CONTRACTS.md`), so Phase 5 wires the real endpoints.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Ground rules
|
||||||
|
- **Native Swift + SwiftUI**, menu-bar app (`LSUIElement = true`). macOS 13.0+
|
||||||
|
(ScreenCaptureKit system audio).
|
||||||
|
- **Never write video frames to disk.** Process in-memory, release immediately;
|
||||||
|
assert this in `VisualObserver` review.
|
||||||
|
- **One shared monotonic `t0`** for audio + visual timestamps — wire first.
|
||||||
|
- **Backend deliverable = `mixed_mono_16k.wav`.** Keep `mic.wav`/`system.wav`
|
||||||
|
locally (mic = the user, and the VAD source for self-labeling).
|
||||||
|
- **SparkControl calls are SEQUENTIAL** (one audio request in flight) over
|
||||||
|
self-signed TLS, **no auth**, primary endpoint `POST /api/audio/label-merge`.
|
||||||
|
- Every adapter must be testable offline against image fixtures.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 0 — Scaffold, permissions, backend ping (milestone: launches + green)
|
||||||
|
- Menu-bar shell; permissions onboarding (Screen Recording, Microphone,
|
||||||
|
Accessibility) with status + deep links.
|
||||||
|
- Settings: output folder, adapter toggles, **SparkControl base URL** + TLS-skip
|
||||||
|
toggle.
|
||||||
|
- `GET /api/status` health check shown in the menu.
|
||||||
|
- **Milestone:** app sits in the menu bar; permissions green; backend reachable.
|
||||||
|
|
||||||
|
## Phase 1 — Audio capture + mix (milestone: clean mixed-mono WAV)
|
||||||
|
- `AudioRecorder`: SCK system audio → `system.wav`; AVAudioEngine mic → `mic.wav`;
|
||||||
|
shared `t0`; produce **`mixed_mono_16k.wav`**.
|
||||||
|
- Mic VAD → user "self" speaking spans (held for the timeline).
|
||||||
|
- Manual start/stop for now.
|
||||||
|
- **Milestone:** record any call manually → tracks line up, mixed-mono plays back
|
||||||
|
clean, mic-VAD spans look right.
|
||||||
|
|
||||||
|
## Phase 2 — Call detection (milestone: hands-free start/stop)
|
||||||
|
- `CallDetector`: CoreAudio mic-running listener + `NSWorkspace` app/frontmost +
|
||||||
|
Meet active-tab-URL check; debounce.
|
||||||
|
- Wire detector → `SessionController` → `AudioRecorder`.
|
||||||
|
- **Milestone:** joining a call in any of the four apps auto-starts; leaving
|
||||||
|
auto-stops; a quick mic test does not false-trigger.
|
||||||
|
|
||||||
|
## Phase 3 — VisualObserver + Zoom adapter (milestone: visual_timeline.json)
|
||||||
|
- `VisualObserver`: window-scoped `SCStream` @3 fps → frames → adapter → release.
|
||||||
|
Background/occluded windows capture fine; detect **minimized** (and, for Meet,
|
||||||
|
**tab-switched**) → pause analysis + record a `visual_gap`, never emit stale
|
||||||
|
frames. Audio keeps recording through gaps.
|
||||||
|
- `AppAdapter` protocol + `ZoomAdapter` (tile-border cue + OCR names; speaker &
|
||||||
|
gallery layouts).
|
||||||
|
- `TimelineBuilder`: hysteresis, overlap-allowed, merge mic-VAD self spans, alias
|
||||||
|
normalization → `visual_timeline.json` (+ the flat `timeline` array form).
|
||||||
|
- **Offline fixtures:** ~30 sample Zoom frames; unit-test/tune thresholds without
|
||||||
|
a live call.
|
||||||
|
- **Milestone:** a real Zoom call yields a `visual_timeline.json` whose segments
|
||||||
|
visibly match who spoke (incl. the user from mic VAD).
|
||||||
|
|
||||||
|
## Phase 4 — Remaining adapters (milestone: all four produce timelines)
|
||||||
|
- `TeamsAdapter`, `SignalAdapter` (a11y names first), `MeetAdapter` (hybrid: a11y
|
||||||
|
names + Vision cue). Fixtures per app; per-adapter fps/threshold config.
|
||||||
|
- **Milestone:** each app yields a sensible timeline; adapters independently
|
||||||
|
toggleable.
|
||||||
|
|
||||||
|
## Phase 5 — SparkControl hand-off via label-merge (milestone: end-to-end named transcript)
|
||||||
|
- `SessionPackager`: write session folder; if call > ~3 min, build the chunk plan
|
||||||
|
(~2–3 min windows) with **timeline sliced + rebased to chunk-local seconds**.
|
||||||
|
- `SparkControlClient`:
|
||||||
|
- Self-signed TLS handling; multipart `file`; **sequential** requests.
|
||||||
|
- Per chunk → `POST /api/audio/label-merge` with `file`, chunk `timeline`,
|
||||||
|
`known_voiceprints` (from store), `transcribe=true`.
|
||||||
|
- Handle `503 + Retry-After` (wait, retry), `413`, `{"detail":...}` errors.
|
||||||
|
- Offset returned `start_ms/end_ms` back to global seconds; concatenate.
|
||||||
|
- Assemble `speakers.json` (named, global, speaker-attributed transcript).
|
||||||
|
- Failed sessions stay on disk with a "Resend" affordance.
|
||||||
|
- **Milestone:** finish a call → `mixed_mono_16k.wav` + timeline delivered → a
|
||||||
|
`speakers.json` with real names + accurate boundaries lands locally. Test a
|
||||||
|
>3-min call to exercise chunking + cross-chunk name unification.
|
||||||
|
|
||||||
|
## Phase 6 — Voiceprint library (milestone: identity that compounds)
|
||||||
|
- `VoiceprintStore` (`voiceprints.json`): load → `known_voiceprints`; on response,
|
||||||
|
store/refresh `fingerprint` for `visual`/high-confidence speakers; skip
|
||||||
|
`Unknown_N`.
|
||||||
|
- Verify the three resolution paths against the backend: (a) **visual** name wins;
|
||||||
|
(b) **voiceprint** recovery for a camera-off / OCR-missed speaker; (c)
|
||||||
|
**`Unknown_N`** when neither matches.
|
||||||
|
- Voiceprint manager in the UI (rename, delete, reset).
|
||||||
|
- **Milestone:** a regular contact gets named on a later call **with their camera
|
||||||
|
off**, purely from a stored voiceprint.
|
||||||
|
|
||||||
|
## Phase 7 — Polish (milestone: daily-driver quality)
|
||||||
|
- Recent-sessions list (open/resend/delete); alias-table editor.
|
||||||
|
- Optional on-screen "recording" indicator toggle.
|
||||||
|
- Adapter health: warn when detection confidence drops sharply (catches UI
|
||||||
|
redesigns early); keep fixtures for quick re-tuning.
|
||||||
|
- Login-item / launch-at-login.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risk register
|
||||||
|
1. **UI fragility (highest ongoing cost):** apps redesign speaking cues / name
|
||||||
|
placement. Isolate all pixel/color/label logic in adapters; Phase-7
|
||||||
|
confidence-drop warning + fixtures keep re-tuning fast.
|
||||||
|
2. **Active-speaker lag / crosstalk:** UI cue trails speech and flickers. Mitigated
|
||||||
|
by treating visual as a *prior* (the backend votes), hysteresis, and the
|
||||||
|
voiceprint fallback.
|
||||||
|
3. **Meet-in-browser:** canvas video + DOM names; the hybrid adapter needs the most
|
||||||
|
iteration — budget extra time.
|
||||||
|
4. **4-speaker-per-chunk cap (Sortformer):** within a single chunk, >4 simultaneous
|
||||||
|
speakers can't all be separated. Chunking + voiceprints + the visual timeline
|
||||||
|
mitigate across chunks; for rare large calls, note it as a known limit.
|
||||||
|
5. **Sequential GPU constraint:** never fire parallel audio requests; queue them.
|
||||||
|
6. **Permission re-prompts on rebuild:** stable signing identity.
|
||||||
|
|
||||||
|
## Definition of done (v1)
|
||||||
|
Join a call in any of the four apps → app auto-records dual-track audio, mixes to
|
||||||
|
mono, and builds a visual speaker timeline (incl. mic-VAD self spans) → on
|
||||||
|
hang-up the audio + timeline go to `label-merge` (chunked + sequential as needed)
|
||||||
|
→ a `speakers.json` with **real names and accurate boundaries** is produced, the
|
||||||
|
**voiceprint library is updated**, and **no video is ever written to disk** and
|
||||||
|
**no transcription/diarization/merge logic lives in this app.**
|
||||||
+49
@@ -0,0 +1,49 @@
|
|||||||
|
name: Ten31Transcripts
|
||||||
|
|
||||||
|
options:
|
||||||
|
bundleIdPrefix: xyz.ten31
|
||||||
|
deploymentTarget:
|
||||||
|
macOS: "13.0"
|
||||||
|
createIntermediateGroups: true
|
||||||
|
groupSortPosition: top
|
||||||
|
|
||||||
|
settings:
|
||||||
|
base:
|
||||||
|
MARKETING_VERSION: "0.1.0"
|
||||||
|
CURRENT_PROJECT_VERSION: "1"
|
||||||
|
SWIFT_VERSION: "5.0"
|
||||||
|
CODE_SIGN_STYLE: Automatic
|
||||||
|
# Leave the team empty; pick your free personal team in Xcode's
|
||||||
|
# Signing & Capabilities tab on first open (see README).
|
||||||
|
DEVELOPMENT_TEAM: ""
|
||||||
|
|
||||||
|
targets:
|
||||||
|
Ten31Transcripts:
|
||||||
|
type: application
|
||||||
|
platform: macOS
|
||||||
|
sources:
|
||||||
|
- path: Ten31Transcripts
|
||||||
|
settings:
|
||||||
|
base:
|
||||||
|
PRODUCT_NAME: Ten31Transcripts
|
||||||
|
PRODUCT_BUNDLE_IDENTIFIER: xyz.ten31.transcripts
|
||||||
|
INFOPLIST_FILE: Ten31Transcripts/Support/Info.plist
|
||||||
|
CODE_SIGN_ENTITLEMENTS: Ten31Transcripts/Support/Ten31Transcripts.entitlements
|
||||||
|
GENERATE_INFOPLIST_FILE: NO
|
||||||
|
ENABLE_HARDENED_RUNTIME: NO
|
||||||
|
MACOSX_DEPLOYMENT_TARGET: "13.0"
|
||||||
|
scheme:
|
||||||
|
testTargets:
|
||||||
|
- Ten31TranscriptsTests
|
||||||
|
|
||||||
|
Ten31TranscriptsTests:
|
||||||
|
type: bundle.unit-test
|
||||||
|
platform: macOS
|
||||||
|
sources:
|
||||||
|
- path: Ten31TranscriptsTests
|
||||||
|
settings:
|
||||||
|
base:
|
||||||
|
PRODUCT_BUNDLE_IDENTIFIER: xyz.ten31.transcripts.tests
|
||||||
|
GENERATE_INFOPLIST_FILE: YES
|
||||||
|
dependencies:
|
||||||
|
- target: Ten31Transcripts
|
||||||
Reference in New Issue
Block a user