Hierarchical Context-Aware Transformers for Non-Autoregressive Text to Speech
Jae-Sung Bae, Tae-Jun Bak, Young-Sun Joo, Hoon-Young Cho

TL;DR
This paper introduces hierarchical, context-aware Transformer structures for non-autoregressive text-to-speech synthesis, enhancing modeling performance by considering data variations and improving pitch accuracy.
Contribution
It proposes a novel hierarchical Transformer-based encoder and decoder tailored to text and audio data characteristics, with improved pitch modeling for TTS.
Findings
Outperforms baseline TNA-TTS in objective evaluations
Achieves better pitch modeling accuracy
Enhances overall speech synthesis quality
Abstract
In this paper, we propose methods for improving the modeling performance of a Transformer-based non-autoregressive text-to-speech (TNA-TTS) model. Although the text encoder and audio decoder handle different types and lengths of data (i.e., text and audio), the TNA-TTS models are not designed considering these variations. Therefore, to improve the modeling performance of the TNA-TTS model we propose a hierarchical Transformer structure-based text encoder and audio decoder that are designed to accommodate the characteristics of each module. For the text encoder, we constrain each self-attention layer so the encoder focuses on a text sequence from the local to the global scope. Conversely, the audio decoder constrains its self-attention layers to focus in the reverse direction, i.e., from global to local scope. Additionally, we further improve the pitch modeling accuracy of the audio…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Layer Normalization · Dropout · Multi-Head Attention · Label Smoothing
