Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR
Christoph Minixhofer, Ondrej Klejch, Peter Bell

TL;DR
This paper compares DDPM and MSE models for TTS in ASR, showing DDPM's superior scalability with more data and speakers, yet highlighting persistent gaps in real speech recognition performance.
Contribution
It provides a systematic comparison of DDPM and MSE models for TTS in ASR, emphasizing DDPM's better scalability and introducing the best real-to-synthetic speech WER ratio to date.
Findings
DDPM outperforms MSE models with increased data and speaker diversity
Achieved the best real-to-synthetic speech WER ratio of 1.46
Significant gap remains between synthetic and real speech recognition
Abstract
Synthetically generated speech has rapidly approached human levels of naturalness. However, the paradox remains that ASR systems, when trained on TTS output that is judged as natural by humans, continue to perform badly on real speech. In this work, we explore whether this phenomenon is due to the oversmoothing behaviour of models commonly used in TTS, with a particular focus on the behaviour of TTS-for-ASR as the amount of TTS training data is scaled up. We systematically compare Denoising Diffusion Probabilistic Models (DDPM) to Mean Squared Error (MSE) based models for TTS, when used for ASR model training. We test the scalability of the two approaches, varying both the number hours, and the number of different speakers. We find that for a given model size, DDPM can make better use of more data, and a more diverse set of speakers, than MSE models. We achieve the best reported ratio…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing
MethodsFocus · Diffusion · Sparse Evolutionary Training
