OverFlow: Putting flows on top of neural transducers for better TTS

Shivam Mehta; Ambika Kirkland; Harm Lameris; Jonas Beskow; \'Eva; Sz\'ekely; Gustav Eje Henter

arXiv:2211.06892·eess.AS·September 15, 2023·1 cites

OverFlow: Putting flows on top of neural transducers for better TTS

Shivam Mehta, Ambika Kirkland, Harm Lameris, Jonas Beskow, \'Eva, Sz\'ekely, Gustav Eje Henter

PDF

Open Access 2 Repos 3 Models 1 Datasets

TL;DR

This paper introduces OverFlow, a neural TTS model combining neural HMMs with normalising flows, resulting in a fully probabilistic, efficient, and high-quality speech synthesis system trained via maximum likelihood.

Contribution

It proposes integrating normalising flows with neural HMMs in TTS, enabling better modeling of speech acoustics and durations with fewer training updates.

Findings

01

Requires fewer updates for accurate pronunciation

02

Achieves near-natural speech quality

03

Fully probabilistic model trained with maximum likelihood

Abstract

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech. Please see https://shivammehta25.github.io/OverFlow/ for audio examples and code.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Pendrokar/open_tts_tracker
dataset· 396 dl
396 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing