Hierarchical Context-Aware Transformers for Non-Autoregressive Text to   Speech

Jae-Sung Bae; Tae-Jun Bak; Young-Sun Joo; Hoon-Young Cho

arXiv:2106.15144·eess.AS·June 30, 2021·Interspeech

Hierarchical Context-Aware Transformers for Non-Autoregressive Text to Speech

Jae-Sung Bae, Tae-Jun Bak, Young-Sun Joo, Hoon-Young Cho

PDF

Open Access

TL;DR

This paper introduces hierarchical, context-aware Transformer structures for non-autoregressive text-to-speech synthesis, enhancing modeling performance by considering data variations and improving pitch accuracy.

Contribution

It proposes a novel hierarchical Transformer-based encoder and decoder tailored to text and audio data characteristics, with improved pitch modeling for TTS.

Findings

01

Outperforms baseline TNA-TTS in objective evaluations

02

Achieves better pitch modeling accuracy

03

Enhances overall speech synthesis quality

Abstract

In this paper, we propose methods for improving the modeling performance of a Transformer-based non-autoregressive text-to-speech (TNA-TTS) model. Although the text encoder and audio decoder handle different types and lengths of data (i.e., text and audio), the TNA-TTS models are not designed considering these variations. Therefore, to improve the modeling performance of the TNA-TTS model we propose a hierarchical Transformer structure-based text encoder and audio decoder that are designed to accommodate the characteristics of each module. For the text encoder, we constrain each self-attention layer so the encoder focuses on a text sequence from the local to the global scope. Conversely, the audio decoder constrains its self-attention layers to focus in the reverse direction, i.e., from global to local scope. Additionally, we further improve the pitch modeling accuracy of the audio…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Layer Normalization · Dropout · Multi-Head Attention · Label Smoothing