PAMA-TTS: Progression-Aware Monotonic Attention for Stable Seq2Seq TTS With Accurate Phoneme Duration Control
Yunchao He, Jian Luan, Yujun Wang

TL;DR
PAMA-TTS introduces a progression-aware monotonic attention mechanism that combines the strengths of attention-based and duration-informed methods to improve stability, naturalness, and duration control in sequence-to-sequence TTS.
Contribution
It proposes a novel attention mechanism leveraging token duration and countdown info to enhance stability and naturalness in TTS.
Findings
Achieves highest naturalness among compared methods.
Maintains comparable or better duration controllability.
Demonstrates stable attention with reduced phoneme errors.
Abstract
Sequence expansion between encoder and decoder is a critical challenge in sequence-to-sequence TTS. Attention-based methods achieve great naturalness but suffer from unstable issues like missing and repeating phonemes, not to mention accurate duration control. Duration-informed methods, on the contrary, seem to easily adjust phoneme duration but show obvious degradation in speech naturalness. This paper proposes PAMA-TTS to address the problem. It takes the advantage of both flexible attention and explicit duration models. Based on the monotonic attention mechanism, PAMA-TTS also leverages token duration and relative position of a frame, especially countdown information, i.e. in how many future frames the present phoneme will end. They help the attention to move forward along the token sequence in a soft but reliable control. Experimental results prove that PAMA-TTS achieves the highest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques
