FloWaveNet : A Generative Flow for Raw Audio
Sungwon Kim, Sang-gil Lee, Jongyoon Song, Jaehyeon Kim, and Sungroh, Yoon

TL;DR
FloWaveNet is a flow-based generative model for raw audio that enables real-time synthesis with a simple, single-stage training process, achieving high-quality sound comparable to more complex models.
Contribution
It introduces FloWaveNet, a novel flow-based model that simplifies training and inference for raw audio synthesis without auxiliary losses or two-stage training.
Findings
Real-time raw audio synthesis achieved with FloWaveNet
Single-stage training with maximum likelihood loss
Comparable audio quality to two-stage models
Abstract
Most modern text-to-speech architectures use a WaveNet vocoder for synthesizing high-fidelity waveform audio, but there have been limitations, such as high inference time, in its practical application due to its ancestral sampling scheme. The recently suggested Parallel WaveNet and ClariNet have achieved real-time audio synthesis capability by incorporating inverse autoregressive flow for parallel sampling. However, these approaches require a two-stage training pipeline with a well-trained teacher network and can only produce natural sound by using probability distillation along with auxiliary loss terms. We propose FloWaveNet, a flow-based generative model for raw audio synthesis. FloWaveNet requires only a single-stage training procedure and a single maximum likelihood loss, without any additional auxiliary terms, and it is inherently parallel due to the characteristics of generative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
