Neural HMMs are all you need (for high-quality attention-free TTS)
Shivam Mehta, \'Eva Sz\'ekely, Jonas Beskow, Gustav Eje Henter

TL;DR
This paper introduces a neural HMM-based TTS model that replaces attention mechanisms with a monotonic, probabilistic approach, resulting in simpler, faster training with high-quality speech synthesis.
Contribution
It combines classical HMMs with neural networks to create an attention-free, monotonic TTS system trained via full sequence likelihood, improving efficiency and control.
Findings
Achieves comparable naturalness to Tacotron 2
Requires fewer training iterations and less data
Enables easy control over speaking rate
Abstract
Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and uses non-monotonic attention. Attention failures increase training time and can make synthesis babble incoherently. This paper describes how the old and new paradigms can be combined to obtain the advantages of both worlds, by replacing attention in neural TTS with an autoregressive left-right no-skip hidden Markov model defined by a neural network. Based on this proposal, we modify Tacotron 2 to obtain an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximation. We also describe how to combine ideas from classical and contemporary TTS for best results. The resulting example system is smaller and simpler than Tacotron 2, and learns to speak…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Highway Layer · Max Pooling · Bidirectional GRU · Highway Network · Sigmoid Activation · CBHG · Dilated Causal Convolution · Convolution · Tanh Activation
