On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition
Nick Rossenbach, Ralf Schl\"uter, Sakriani Sakti

TL;DR
This paper compares five TTS decoder architectures for synthetic data generation in ASR, analyzing their impact on recognition performance and proposing a method to assess TTS generalization.
Contribution
It introduces a comparative analysis of TTS decoder architectures for synthetic data in ASR and proposes a new approach to quantify TTS generalization capabilities.
Findings
Auto-regressive decoding outperforms non-autoregressive in data generation.
No clear correlation between NISQA MOS, intelligibility, and ASR performance.
Different TTS architectures significantly affect synthetic data quality and recognition results.
Abstract
The rapid development of neural text-to-speech (TTS) systems enabled its usage in other areas of natural language processing such as automatic speech recognition (ASR) or spoken language translation (SLT). Due to the large number of different TTS architectures and their extensions, selecting which TTS systems to use for synthetic data creation is not an easy task. We use the comparison of five different TTS decoder architectures in the scope of synthetic data generation to show the impact on CTC-based speech recognition training. We compare the recognition results to computable metrics like NISQA MOS and intelligibility, finding that there are no clear relations to the ASR performance. We also observe that for data generation auto-regressive decoding performs better than non-autoregressive decoding, and propose an approach to quantify TTS generalization capabilities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
