Training Neural Speech Recognition Systems with Synthetic Speech Augmentation
Jason Li, Ravi Gadde, Boris Ginsburg, Vitaly Lavrukhin

TL;DR
This paper explores augmenting natural speech datasets with synthetic speech to improve neural speech recognition systems, achieving state-of-the-art results without external language models.
Contribution
It introduces a method of augmenting datasets with synthetic speech and demonstrates improved performance of large end-to-end neural ASR models.
Findings
Achieved state-of-the-art Word Error Rate on LibriSpeech
Synthetic speech augmentation enhances model accuracy
No external language model needed for top performance
Abstract
Building an accurate automatic speech recognition (ASR) system requires a large dataset that contains many hours of labeled speech samples produced by a diverse set of speakers. The lack of such open free datasets is one of the main issues preventing advancements in ASR research. To address this problem, we propose to augment a natural speech dataset with synthetic speech. We train very large end-to-end neural speech recognition models using the LibriSpeech dataset augmented with synthetic speech. These new models achieve state of the art Word Error Rate (WER) for character-level based models without an external language model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
