InferGrad: Improving Diffusion Models for Vocoder by Considering Inference in Training
Zehua Chen, Xu Tan, Ke Wang, Shifeng Pan, Danilo Mandic, Lei He, Sheng, Zhao

TL;DR
InferGrad is a diffusion model for vocoder that integrates inference considerations into training, enabling faster inference with fewer iterations while maintaining high voice quality, outperforming baseline models.
Contribution
It introduces a training method that jointly optimizes for inference efficiency and quality, reducing inference iterations without sacrificing output fidelity.
Findings
InferGrad achieves 3x faster inference speed compared to WaveGrad.
It maintains comparable voice quality with fewer inference iterations.
Experimental results on LJSpeech demonstrate superior performance over baseline models.
Abstract
Denoising diffusion probabilistic models (diffusion models for short) require a large number of iterations in inference to achieve the generation quality that matches or surpasses the state-of-the-art generative models, which invariably results in slow inference speed. Previous approaches aim to optimize the choice of inference schedule over a few iterations to speed up inference. However, this results in reduced generation quality, mainly because the inference process is optimized separately, without jointly optimizing with the training process. In this paper, we propose InferGrad, a diffusion model for vocoder that incorporates inference process into training, to reduce the inference iterations while maintaining high generation quality. More specifically, during training, we generate data from random noise through a reverse process under inference schedules with a few iterations, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsHuMan(Expedia)||How do I get a human at Expedia? · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion · Residual Connection · 1x1 Convolution · WaveGrad DBlock · WaveGrad UBlock · Convolution · FiLM Module · WaveGrad
