Neural HMMs are all you need (for high-quality attention-free TTS)

Shivam Mehta; \'Eva Sz\'ekely; Jonas Beskow; Gustav Eje Henter

arXiv:2108.13320·eess.AS·May 3, 2022

Neural HMMs are all you need (for high-quality attention-free TTS)

Shivam Mehta, \'Eva Sz\'ekely, Jonas Beskow, Gustav Eje Henter

PDF

2 Repos 3 Models 1 Datasets

TL;DR

This paper introduces a neural HMM-based TTS model that replaces attention mechanisms with a monotonic, probabilistic approach, resulting in simpler, faster training with high-quality speech synthesis.

Contribution

It combines classical HMMs with neural networks to create an attention-free, monotonic TTS system trained via full sequence likelihood, improving efficiency and control.

Findings

01

Achieves comparable naturalness to Tacotron 2

02

Requires fewer training iterations and less data

03

Enables easy control over speaking rate

Abstract

Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and uses non-monotonic attention. Attention failures increase training time and can make synthesis babble incoherently. This paper describes how the old and new paradigms can be combined to obtain the advantages of both worlds, by replacing attention in neural TTS with an autoregressive left-right no-skip hidden Markov model defined by a neural network. Based on this proposal, we modify Tacotron 2 to obtain an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximation. We also describe how to combine ideas from classical and contemporary TTS for best results. The resulting example system is smaller and simpler than Tacotron 2, and learns to speak…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Pendrokar/open_tts_tracker
dataset· 396 dl
396 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Highway Layer · Max Pooling · Bidirectional GRU · Highway Network · Sigmoid Activation · CBHG · Dilated Causal Convolution · Convolution · Tanh Activation