PAMA-TTS: Progression-Aware Monotonic Attention for Stable Seq2Seq TTS   With Accurate Phoneme Duration Control

Yunchao He; Jian Luan; Yujun Wang

arXiv:2110.04486·cs.SD·March 21, 2022

PAMA-TTS: Progression-Aware Monotonic Attention for Stable Seq2Seq TTS With Accurate Phoneme Duration Control

Yunchao He, Jian Luan, Yujun Wang

PDF

Open Access

TL;DR

PAMA-TTS introduces a progression-aware monotonic attention mechanism that combines the strengths of attention-based and duration-informed methods to improve stability, naturalness, and duration control in sequence-to-sequence TTS.

Contribution

It proposes a novel attention mechanism leveraging token duration and countdown info to enhance stability and naturalness in TTS.

Findings

01

Achieves highest naturalness among compared methods.

02

Maintains comparable or better duration controllability.

03

Demonstrates stable attention with reduced phoneme errors.

Abstract

Sequence expansion between encoder and decoder is a critical challenge in sequence-to-sequence TTS. Attention-based methods achieve great naturalness but suffer from unstable issues like missing and repeating phonemes, not to mention accurate duration control. Duration-informed methods, on the contrary, seem to easily adjust phoneme duration but show obvious degradation in speech naturalness. This paper proposes PAMA-TTS to address the problem. It takes the advantage of both flexible attention and explicit duration models. Based on the monotonic attention mechanism, PAMA-TTS also leverages token duration and relative position of a frame, especially countdown information, i.e. in how many future frames the present phoneme will end. They help the attention to move forward along the token sequence in a soft but reliable control. Experimental results prove that PAMA-TTS achieves the highest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques