FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech   Synthesis

Rongjie Huang; Max W. Y. Lam; Jun Wang; Dan Su; Dong Yu; Yi Ren; Zhou; Zhao

arXiv:2204.09934·eess.AS·April 22, 2022·28 cites

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, Zhou, Zhao

PDF

Open Access 2 Repos 2 Models

TL;DR

FastDiff introduces a rapid, high-quality diffusion-based speech synthesis model that significantly reduces sampling time, enabling practical real-time applications and outperforming existing methods in speech quality and speaker generalization.

Contribution

The paper presents FastDiff, a novel fast conditional diffusion model with adaptive convolutions and a noise schedule predictor, advancing speech synthesis efficiency and quality.

Findings

01

Achieves state-of-the-art speech quality with MOS 4.28.

02

Enables 58x faster-than-real-time sampling on GPU.

03

Outperforms competitors in end-to-end speech synthesis.

Abstract

Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hindered their applications to speech synthesis. This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies with adaptive conditions. A noise schedule predictor is also adopted to reduce the sampling steps without sacrificing the generation quality. Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms without any intermediate feature (e.g., Mel-spectrogram). Our evaluation of FastDiff demonstrates the state-of-the-art results with higher-quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion