WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on   Fixed-Point Iteration

Yuma Koizumi; Kohei Yatabe; Heiga Zen; Michiel Bacchiani

arXiv:2210.01029·eess.AS·October 4, 2022·1 cites

WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration

Yuma Koizumi, Kohei Yatabe, Heiga Zen, Michiel Bacchiani

PDF

Open Access

TL;DR

WaveFit is a novel neural vocoder that combines iterative fixed-point denoising with adversarial training, achieving high-quality speech synthesis with significantly faster inference than existing models.

Contribution

This paper introduces WaveFit, a neural vocoder that integrates GAN-like adversarial training into a DDPM-inspired iterative framework for improved speed and quality.

Findings

01

WaveFit achieves naturalness comparable to human speech in listening tests.

02

Inference speed of WaveFit is over 240 times faster than WaveRNN.

03

WaveFit effectively denoises audio through fixed-point iteration with adversarial loss.

Abstract

Denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs) are popular generative models for neural vocoders. The DDPMs and GANs can be characterized by the iterative denoising framework and adversarial training, respectively. This study proposes a fast and high-quality neural vocoder called \textit{WaveFit}, which integrates the essence of GANs into a DDPM-like iterative framework based on fixed-point iteration. WaveFit iteratively denoises an input signal, and trains a deep neural network (DNN) for minimizing an adversarial loss calculated from intermediate outputs at all iterations. Subjective (side-by-side) listening tests showed no statistically significant differences in naturalness between human natural speech and those synthesized by WaveFit with five iterations. Furthermore, the inference speed of WaveFit was more than 240 times faster than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Sigmoid Activation · *Communicated@Fast*How Do I Communicate to Expedia? · Softmax · Tanh Activation · WaveRNN · Diffusion