Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments
Yusuke Yasuda, Xin Wang, Junichi Yamagishi

TL;DR
This paper introduces a novel end-to-end TTS framework that uses explicit, monotonic hard alignments with marginalization, eliminating the need for soft attention mechanisms and simplifying training.
Contribution
It proposes the first end-to-end TTS system that employs monotonic hard latent alignments marginalized during training, improving robustness and simplicity.
Findings
Achieved more robust alignment learning.
Simplified the TTS model architecture.
Demonstrated effective speech synthesis without soft attention.
Abstract
End-to-end text-to-speech (TTS) synthesis is a method that directly converts input text to output acoustic features using a single network. A recent advance of end-to-end TTS is due to a key technique called attention mechanisms, and all successful methods proposed so far have been based on soft attention mechanisms. However, although network structures are becoming increasingly complex, end-to-end TTS systems with soft attention mechanisms may still fail to learn and to predict accurate alignment between the input and output. This may be because the soft attention mechanisms are too flexible. Therefore, we propose an approach that has more explicit but natural constraints suitable for speech signals to make alignment learning and prediction of end-to-end TTS systems more robust. The proposed system, with the constrained alignment scheme borrowed from segment-to-segment neural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
