"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most
Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou

TL;DR
This paper investigates the failure of speech recognition systems in transcribing high-stakes, real-world utterances like street names, revealing significant error rates and proposing a synthetic data augmentation method to improve accuracy for diverse speakers.
Contribution
The study identifies a critical gap between benchmark performance and real-world reliability and introduces a synthetic data generation approach to significantly enhance transcription accuracy for non-English speakers.
Findings
Average transcription error rate of 44% on street names.
Synthetic data augmentation improves accuracy by nearly 60% for non-English speakers.
Routing distance errors are twice as large for non-English primary speakers.
Abstract
Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
