39beccf7f4
Root cause of the "4 people → 2 speakers" Meet call: the colored-border detector
read solid camera-off avatar tiles (orange "J", magenta "G") as active speakers
for the ENTIRE call. Those whole-call phantom spans dominated backend name
attribution, collapsing every remote voice onto one name — and the giant filled
bbox also swallowed screen-share text (WERUNBTC.COM ×49) as a speaker.
Validated against 9 real fixtures (harness over the real MeetAdapter):
Detection:
- FrameSampler.thinColoredPoints: coloured counterpart of thinWhitePoints — keeps
thin border/ring/pill edges, drops solid colour fills.
- GridCallAnalyzer.isHollow: reject a highlight component whose interior is filled
(a solid tile) vs a hollow ring (a real border). Config.maxInteriorFill (0.2 default).
- MeetAdapter: detect thin BLUE edges only (hue 180–240°, measured from the
fixtures), maxInteriorFill 0.3 (real Meet rings ≈0.2–0.3, solid tiles ≈0.36).
- Result on fixtures: John Arnold/Grant Gilliam (solid tiles) now NEVER detected;
Matt Odell/Mark detected when their blue cue is present. Sparse but never wrong —
correct for a naming hint over audio diarization.
OCR name hygiene:
- isLikelyName rejects domain-like screen-share text ("WERUNBTC.COM", OCR'd ".GOM").
- cleaned() strips trailing punctuation ("Mark." → "Mark").
- TimelineBuilder.canonicalizeByFrequency folds rare OCR misspellings into a
dominant near-twin name ("Matt Odel"/"MattOdell" → "Matt Odell", "Mare" → "Mark").
Tests: hollow-ring, extended OCR filter, fuzzy-merge. 65 pass.
59 lines
2.8 KiB
Swift
59 lines
2.8 KiB
Swift
import Foundation
|
||
import CoreVideo
|
||
|
||
/// Google Meet adapter (browser tab — capture is at the browser-window level).
|
||
///
|
||
/// Meet's active-speaker cue is a **coloured (Google-blue) ring/glow** around the
|
||
/// speaking participant's tile, plus animated audio bars in the tile's mic chip.
|
||
/// The participant **name sits in the tile's bottom-LEFT corner**, so the tile is
|
||
/// estimated extending up and to the right of the name.
|
||
///
|
||
/// Detection *logic* is validated on synthetic frames; the geometry constants are a
|
||
/// first pass and will be calibrated against real Meet screenshots. Meet runs in a
|
||
/// browser, so there's no Accessibility name source we rely on — OCR only.
|
||
struct MeetAdapter: AppAdapter {
|
||
// Browsers that can host a Meet tab. The window, not the app, is what we capture;
|
||
// CallDetector decides a browser window is a Meet call by its title.
|
||
static let bundleIDs = [
|
||
"com.google.Chrome", "org.mozilla.firefox", "com.apple.Safari",
|
||
"company.thebrowser.Browser", "com.brave.Browser", "com.microsoft.edgemac",
|
||
"com.google.Chrome.canary", "org.chromium.Chromium",
|
||
]
|
||
let adapterVersion = "meet-0.1.0"
|
||
let preferredFPS = 3
|
||
|
||
private let analyzer: GridCallAnalyzer
|
||
|
||
init() {
|
||
var config = GridCallAnalyzer.Config()
|
||
config.nameAnchor = .bottomLeft
|
||
config.detectColoredBorder = true // Google-blue speaking ring/glow
|
||
config.detectWhiteBorder = false
|
||
// The bright ring (#1a73e8) is ~0.89 sat but the lighter glow (#8ab4f8) is
|
||
// ~0.44, below the 0.5 default — lower the threshold so the glow registers.
|
||
config.colorSaturation = 0.35
|
||
// Meet's active cue is a thin BLUE (≈210°) ring + audio pill. Detect thin blue
|
||
// EDGES only, gated to blue: this rejects solid camera-off avatar tiles (orange
|
||
// ≈30°, magenta ≈340°), which otherwise read as "speaking" for the whole call
|
||
// and collapse every remote voice onto one name. Validated on real fixtures.
|
||
config.coloredBorderThinOnly = true
|
||
config.colorHueRange = 180...240
|
||
// Meet's blue border is faint; real rings measure ≈0.20–0.30 interior fill while
|
||
// solid tiles measure ≈0.36, so allow a higher fill here than the 0.2 default to
|
||
// recover real borders without readmitting the solid-tile false positives.
|
||
config.maxInteriorFill = 0.3
|
||
config.tileExpandX = 3.0
|
||
config.tileExpandY = 5.0
|
||
self.analyzer = GridCallAnalyzer(config: config)
|
||
}
|
||
|
||
func analyze(frame: CVPixelBuffer, at t: TimeInterval) -> [SpeakerObservation] {
|
||
analyzer.analyze(pixelBuffer: frame, at: t)
|
||
}
|
||
|
||
// Exposed for fixture/synthetic tests.
|
||
func analyze(cgImage: CGImage, at t: TimeInterval) -> [SpeakerObservation] {
|
||
analyzer.analyze(cgImage: cgImage, at: t)
|
||
}
|
||
}
|