Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech   Synthesis in ASR

Christoph Minixhofer; Ondrej Klejch; Peter Bell

arXiv:2410.12279·eess.AS·October 17, 2024

Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR

Christoph Minixhofer, Ondrej Klejch, Peter Bell

PDF

Open Access

TL;DR

This paper compares DDPM and MSE models for TTS in ASR, showing DDPM's superior scalability with more data and speakers, yet highlighting persistent gaps in real speech recognition performance.

Contribution

It provides a systematic comparison of DDPM and MSE models for TTS in ASR, emphasizing DDPM's better scalability and introducing the best real-to-synthetic speech WER ratio to date.

Findings

01

DDPM outperforms MSE models with increased data and speaker diversity

02

Achieved the best real-to-synthetic speech WER ratio of 1.46

03

Significant gap remains between synthetic and real speech recognition

Abstract

Synthetically generated speech has rapidly approached human levels of naturalness. However, the paradox remains that ASR systems, when trained on TTS output that is judged as natural by humans, continue to perform badly on real speech. In this work, we explore whether this phenomenon is due to the oversmoothing behaviour of models commonly used in TTS, with a particular focus on the behaviour of TTS-for-ASR as the amount of TTS training data is scaled up. We systematically compare Denoising Diffusion Probabilistic Models (DDPM) to Mean Squared Error (MSE) based models for TTS, when used for ASR model training. We test the scalability of the two approaches, varying both the number hours, and the number of different speakers. We find that for a given model size, DDPM can make better use of more data, and a more diverse set of speakers, than MSE models. We achieve the best reported ratio…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing

MethodsFocus · Diffusion · Sparse Evolutionary Training