ProDiff: Progressive Fast Diffusion Model For High-Quality   Text-to-Speech

Rongjie Huang; Zhou Zhao; Huadai Liu; Jinglin Liu; Chenye Cui; Yi Ren

arXiv:2207.06389·eess.AS·July 14, 2022·21 cites

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, Yi Ren

PDF

Open Access 4 Repos 2 Models 1 Datasets

TL;DR

ProDiff introduces a progressive diffusion model for text-to-speech that dramatically reduces sampling steps to just 2, achieving high-quality speech synthesis at speeds 24 times faster than real-time.

Contribution

ProDiff is the first diffusion TTS model to use direct data prediction and knowledge distillation to enable high-quality synthesis with only 2 sampling steps.

Findings

01

Requires only 2 iterations for high-fidelity mel-spectrograms

02

Achieves 24x faster-than-real-time synthesis speed

03

Maintains competitive quality and diversity with state-of-the-art models

Abstract

Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hinder their applications to text-to-speech deployment. Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling. In this work, we propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. Unlike previous work estimating the gradient for data density, ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. To tackle the model convergence challenge with decreased diffusion iterations, ProDiff reduces the data variance in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

purdueviperlab/diffssd
dataset· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion