From Black Box to Glass Box: Cross-Model ASR Disagreement to Prioto Review in Ambient AI Scribe Documentation
Abdolamir Karbalaie, Fernando Seoane, Farhad Abtahi

TL;DR
This study explores using disagreement among multiple ASR systems to identify unreliable segments in medical transcription, aiming to facilitate targeted human review without needing reference transcripts.
Contribution
It demonstrates that cross-model disagreement can serve as an effective, reference-free uncertainty signal to prioritize review in clinical ASR workflows.
Findings
72.1% of tokens showed near-unanimous agreement among models.
High-risk disagreement regions ranged from 0.7% to 11.4% across accent groups.
Content disagreements increased in low-agreement regions, indicating potential unreliability.
Abstract
Ambient AI "scribe" systems promise to reduce clinical documentation burden, but automatic speech recognition (ASR) errors can remain unnoticed without careful review, and high-quality human reference transcripts are often unavailable for calibrating uncertainty. We investigate whether cross-model disagreement among heterogeneous ASR systems can act as a reference-free uncertainty signal to prioritize human verification in medical transcription workflows. Using 50 publicly available medical education audio clips (8 h 14 min), we transcribed each clip with eight ASR systems spanning commercial APIs and open-source engines. We aligned multi-model outputs, built consensus pseudo-references, and quantified token-level agreement using a majority-strength metric; we further characterized disagreements by type (content vs. punctuation/formatting) and assessed per-model agreement via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
