Toward domain-invariant speech recognition via large scale training
Arun Narayanan, Ananya Misra, Khe Chai Sim, Golan Pundak, Anshuman, Tripathi, Mohamed Elfeky, Parisa Haghani, Trevor Strohman, Michiel Bacchiani

TL;DR
This paper demonstrates that training a large-scale, domain-invariant speech recognition model on 162,000 hours of data with simulated distortions results in a system that generalizes well across multiple domains and adapts efficiently to new conditions with minimal data.
Contribution
The authors propose and validate a large-scale training approach for domain-invariant speech recognition, enabling robust performance across diverse conditions and rapid adaptation with limited data.
Findings
A single model performs nearly as well as domain-specific models.
The model generalizes better to unseen conditions.
Minimal data (10 hours) suffices for effective adaptation.
Abstract
Current state-of-the-art automatic speech recognition systems are trained to work in specific `domains', defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. This work explores the idea of building a single domain-invariant model for varied use-cases by combining large scale training data from multiple application domains. Our final system is trained using 162,000 hours of speech. Additionally, each utterance is artificially distorted during training to simulate effects like background noise, codec distortion, and sampling rates. Our results show that, even at such a scale, a model thus trained works almost as well as those fine-tuned to specific subsets: A single model can be robust to multiple application domains, and variations like codecs and noise.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
