Nana-HDR: A Non-attentive Non-autoregressive Hybrid Model for TTS

Shilun Lin; Wenchao Su; Li Meng; Fenglong Xie; Xinhui Li; Li Lu

arXiv:2109.13673·cs.CL·September 29, 2021

Nana-HDR: A Non-attentive Non-autoregressive Hybrid Model for TTS

Shilun Lin, Wenchao Su, Li Meng, Fenglong Xie, Xinhui Li, Li Lu

PDF

Open Access

TL;DR

Nana-HDR introduces a hybrid non-attentive, non-autoregressive TTS model combining Transformer and RNN components, achieving high naturalness and robustness in Mandarin speech synthesis.

Contribution

It proposes a novel Dense-fuse encoder and a duration predictor to enhance non-autoregressive TTS without attention mechanisms.

Findings

01

Achieves competitive naturalness in Mandarin TTS

02

Demonstrates robustness across different datasets

03

Utilizes hybrid Transformer-RNN architecture effectively

Abstract

This paper presents Nana-HDR, a new non-attentive non-autoregressive model with hybrid Transformer-based Dense-fuse encoder and RNN-based decoder for TTS. It mainly consists of three parts: Firstly, a novel Dense-fuse encoder with dense connections between basic Transformer blocks for coarse feature fusion and a multi-head attention layer for fine feature fusion. Secondly, a single-layer non-autoregressive RNN-based decoder. Thirdly, a duration predictor instead of an attention model that connects the above hybrid encoder and decoder. Experiments indicate that Nana-HDR gives full play to the advantages of each component, such as strong text encoding ability of Transformer-based encoder, stateful decoding without being bothered by exposure bias and local information preference, and stable alignment provided by duration predictor. Due to these advantages, Nana-HDR achieves competitive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsAttention Is All You Need · Linear Layer · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Dense Connections · Byte Pair Encoding · Label Smoothing · Residual Connection