TL;DR
This paper questions the effectiveness of current ASR evaluation practices, showing that models trained on diverse datasets generalize better across domains and that averaging performance over multiple benchmarks is a reliable indicator of real-world robustness.
Contribution
The study demonstrates the importance of diverse training data and multiple benchmarks for improving and assessing ASR model robustness and generalization.
Findings
Reverberation and noise augmentation enhance cross-domain performance.
Average WER over multiple benchmarks correlates with real-world robustness.
Combined training on multiple datasets yields competitive results.
Abstract
Is pushing numbers on a single benchmark valuable in automatic speech recognition? Research results in acoustic modeling are typically evaluated based on performance on a single dataset. While the research community has coalesced around various benchmarks, we set out to understand generalization performance in acoustic modeling across datasets - in particular, if models trained on a single dataset transfer to other (possibly out-of-domain) datasets. We show that, in general, reverberative and additive noise augmentation improves generalization performance across domains. Further, we demonstrate that when a large enough set of benchmarks is used, average word error rate (WER) performance over them provides a good proxy for performance on real-world noisy data. Finally, we show that training a single acoustic model on the most widely-used datasets - combined - reaches competitive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
