WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration
Yuma Koizumi, Kohei Yatabe, Heiga Zen, Michiel Bacchiani

TL;DR
WaveFit is a novel neural vocoder that combines iterative fixed-point denoising with adversarial training, achieving high-quality speech synthesis with significantly faster inference than existing models.
Contribution
This paper introduces WaveFit, a neural vocoder that integrates GAN-like adversarial training into a DDPM-inspired iterative framework for improved speed and quality.
Findings
WaveFit achieves naturalness comparable to human speech in listening tests.
Inference speed of WaveFit is over 240 times faster than WaveRNN.
WaveFit effectively denoises audio through fixed-point iteration with adversarial loss.
Abstract
Denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs) are popular generative models for neural vocoders. The DDPMs and GANs can be characterized by the iterative denoising framework and adversarial training, respectively. This study proposes a fast and high-quality neural vocoder called \textit{WaveFit}, which integrates the essence of GANs into a DDPM-like iterative framework based on fixed-point iteration. WaveFit iteratively denoises an input signal, and trains a deep neural network (DNN) for minimizing an adversarial loss calculated from intermediate outputs at all iterations. Subjective (side-by-side) listening tests showed no statistically significant differences in naturalness between human natural speech and those synthesized by WaveFit with five iterations. Furthermore, the inference speed of WaveFit was more than 240 times faster than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Sigmoid Activation · *Communicated@Fast*How Do I Communicate to Expedia? · Softmax · Tanh Activation · WaveRNN · Diffusion
