ASR Benchmarking: Need for a More Representative Conversational Dataset
Gaurav Maheshwari, Dmitry Ivanov, Th\'eo Johannet, Kevin El Haddad

TL;DR
This paper highlights the inadequacy of current ASR benchmarks for real-world conversations and introduces a new multilingual dataset to better evaluate ASR performance in realistic, disfluent speech scenarios.
Contribution
The study presents a new multilingual conversational dataset from TalkBank, emphasizing the need for more representative benchmarks for ASR systems.
Findings
Significant performance drops of ASR models in conversational settings
Correlation between disfluencies and increased Word Error Rate
Current benchmarks do not reflect real-world conversational complexities
Abstract
Automatic Speech Recognition (ASR) systems have achieved remarkable performance on widely used benchmarks such as LibriSpeech and Fleurs. However, these benchmarks do not adequately reflect the complexities of real-world conversational environments, where speech is often unstructured and contains disfluencies such as pauses, interruptions, and diverse accents. In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversation between adults. Our results show a significant performance drop across various state-of-the-art ASR models when tested in conversational settings. Furthermore, we observe a correlation between Word Error Rate and the presence of speech disfluencies, highlighting the critical need for more realistic, conversational ASR benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
