Investigating the Sensitivity of Automatic Speech Recognition Systems to Phonetic Variation in L2 Englishes
Emma O'Neill, Julie Carson-Berndsen

TL;DR
This paper investigates how automatic speech recognition systems respond to phonetic variations in non-native English speech, revealing systematic errors and suggesting ways to improve robustness through targeted training.
Contribution
It introduces a method to probe ASR systems for phonetic variation sensitivity and demonstrates systematic error patterns linked to specific L1 backgrounds.
Findings
ASR errors are systematic across speakers with similar L1 backgrounds.
Phoneme substitution errors often align with human annotators.
Identifying problematic pronunciations can guide targeted system improvements.
Abstract
Automatic Speech Recognition (ASR) systems exhibit the best performance on speech that is similar to that on which it was trained. As such, underrepresented varieties including regional dialects, minority-speakers, and low-resource languages, see much higher word error rates (WERs) than those varieties seen as 'prestigious', 'mainstream', or 'standard'. This can act as a barrier to incorporating ASR technology into the annotation process for large-scale linguistic research since the manual correction of the erroneous automated transcripts can be just as time and resource consuming as manual transcriptions. A deeper understanding of the behaviour of an ASR system is thus beneficial from a speech technology standpoint, in terms of improving ASR accuracy, and from an annotation standpoint, where knowing the likely errors made by an ASR system can aid in this manual correction. This work…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
