SynthASR: Unlocking Synthetic Data for Speech Recognition
Amin Fazel, Wei Yang, Yulan Liu, Roberto Barra-Chicote, Yixiong Meng,, Roland Maas, Jasha Droppo

TL;DR
SynthASR demonstrates that synthetic speech can effectively train end-to-end speech recognition models, especially for new applications with limited data, reducing costs and dependency on real data.
Contribution
The paper introduces a novel multi-stage training strategy utilizing synthetic speech for E2E ASR models, addressing catastrophic forgetting and improving performance on new applications.
Findings
Over 65% relative improvement in recognizing medication names.
Effective training of large-scale E2E ASR models with synthetic data.
Reduced dependency on real, costly data for new application development.
Abstract
End-to-end (E2E) automatic speech recognition (ASR) models have recently demonstrated superior performance over the traditional hybrid ASR models. Training an E2E ASR model requires a large amount of data which is not only expensive but may also raise dependency on production data. At the same time, synthetic speech generated by the state-of-the-art text-to-speech (TTS) engines has advanced to near-human naturalness. In this work, we propose to utilize synthetic speech for ASR training (SynthASR) in applications where data is sparse or hard to get for ASR model training. In addition, we apply continual learning with a novel multi-stage training strategy to address catastrophic forgetting, achieved by a mix of weighted multi-style training, data augmentation, encoder freezing, and parameter regularization. In our experiments conducted on in-house datasets for a new application of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
