Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
Jaehyeon Kim, Sungwon Kim, Jungil Kong, Sungroh Yoon

TL;DR
Glow-TTS introduces a flow-based, non-autoregressive TTS model that efficiently generates high-quality speech, learns alignments internally, and supports fast, diverse, and multi-speaker synthesis without external aligners.
Contribution
It presents Glow-TTS, a novel flow-based model that learns monotonic alignments internally, enabling fast, robust, and controllable speech synthesis without external guidance.
Findings
Achieves an order-of-magnitude faster synthesis than Tacotron 2.
Maintains comparable speech quality to autoregressive models.
Extensible to multi-speaker TTS settings.
Abstract
Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel. Despite the advantage, the parallel TTS models cannot be trained without guidance from autoregressive TTS models as their external aligners. In this work, we propose Glow-TTS, a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech on its own. We demonstrate that enforcing hard monotonic alignments enables robust TTS, which generalizes to long utterances, and employing generative flows enables fast, diverse, and controllable speech synthesis. Glow-TTS obtains an order-of-magnitude speed-up over the autoregressive model, Tacotron 2,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsInvertible 1x1 Convolution · Affine Coupling · Activation Normalization · Normalizing Flows · GLOW · Glow-TTS
