Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation
Dancheng Liu, Amir Nassereldine, Chenhui Xu, Jinjun Xiong

TL;DR
This paper demonstrates that acoustic variation in training data, enhanced through targeted augmentation, significantly improves the robustness of ASR models, offering an effective alternative to large-scale datasets.
Contribution
It reveals that acoustic diversity, rather than linguistic richness, is key for ASR robustness and proposes acoustic-aware data augmentation as a novel strategy.
Findings
Reduced word-error rates by up to 19.24% with augmentation.
Acoustic variation impacts transcription generalization more than linguistic diversity.
Targeted augmentation improves robustness on unseen datasets.
Abstract
Whisper's robust performance in automatic speech recognition (ASR) is often attributed to its massive 680k-hour training set, an impractical scale for most researchers. In this work, we examine how linguistic and acoustic diversity in training data affect the robustness of the ASR model and reveal that transcription generalization is primarily driven by acoustic variation rather than linguistic richness. We find that targeted acoustic augmentation methods could significantly improve the generalization ability of ASR models, reducing word-error rates by up to 19.24 percent on unseen datasets when training on the 960-hour Librispeech dataset. These findings highlight strategic acoustically focused data augmentation as a promising alternative to massive datasets for building robust ASR models, offering a potential solution to future foundation ASR models when massive human speech data is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsUnderwater Acoustics Research · Geophysical Methods and Applications · Speech and Audio Processing
