ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to   Speech

Zehua Chen; Yihan Wu; Yichong Leng; Jiawei Chen; Haohe Liu; Xu Tan,; Yang Cui; Ke Wang; Lei He; Sheng Zhao; Jiang Bian; Danilo Mandic

arXiv:2212.14518·eess.AS·January 2, 2023·5 cites

ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech

Zehua Chen, Yihan Wu, Yichong Leng, Jiawei Chen, Haohe Liu, Xu Tan,, Yang Cui, Ke Wang, Lei He, Sheng Zhao, Jiang Bian, Danilo Mandic

PDF

Open Access 1 Repo

TL;DR

ResGrad is a lightweight, residual-based diffusion model that accelerates text-to-speech synthesis by refining existing model outputs, achieving faster inference without sacrificing sample quality.

Contribution

ResGrad introduces a residual learning approach for DDPM-based TTS, enabling faster inference and compatibility with existing models without retraining.

Findings

01

ResGrad outperforms other speed-up methods in sample quality at the same speed.

02

ResGrad reduces synthesis time by more than 10 times compared to baseline methods.

03

ResGrad maintains high speech quality across multiple datasets.

Abstract

Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

majidAdibian77/ResGrad
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion