Speech Recognition with Augmented Synthesized Speech
Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Ye Jia, Pedro Moreno,, Yonghui Wu, Zelin Wu

TL;DR
This paper investigates using synthesized speech generated by Tacotron for data augmentation to improve speech recognition across domains, showing potential but highlighting existing performance gaps.
Contribution
It evaluates the effectiveness of synthesized speech for data augmentation in speech recognition and explores algorithms to enhance acoustic and lexical diversity.
Findings
Synthesized speech can improve recognition accuracy when used for data augmentation.
There is a significant performance gap between recognizers trained on human versus synthesized speech.
Augmentation with synthesized speech shows promise for domain transfer in speech recognition.
Abstract
Recent success of the Tacotron speech synthesis architecture and its variants in producing natural sounding multi-speaker synthesized speech has raised the exciting possibility of replacing expensive, manually transcribed, domain-specific, human speech that is used to train speech recognizers. The multi-speaker speech synthesis architecture can learn latent embedding spaces of prosody, speaker and style variations derived from input acoustic representations thereby allowing for manipulation of the synthesized speech. In this paper, we evaluate the feasibility of enhancing speech recognition performance using speech synthesis using two corpora from different domains. We explore algorithms to provide the necessary acoustic and lexical diversity needed for robust speech recognition. Finally, we demonstrate the feasibility of this approach as a data augmentation strategy for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsGriffin-Lim Algorithm · Sigmoid Activation · Highway Layer · Residual Connection · Convolution · Batch Normalization · Max Pooling · Residual GRU · Bidirectional GRU · Highway Network
