Nana-HDR: A Non-attentive Non-autoregressive Hybrid Model for TTS
Shilun Lin, Wenchao Su, Li Meng, Fenglong Xie, Xinhui Li, Li Lu

TL;DR
Nana-HDR introduces a hybrid non-attentive, non-autoregressive TTS model combining Transformer and RNN components, achieving high naturalness and robustness in Mandarin speech synthesis.
Contribution
It proposes a novel Dense-fuse encoder and a duration predictor to enhance non-autoregressive TTS without attention mechanisms.
Findings
Achieves competitive naturalness in Mandarin TTS
Demonstrates robustness across different datasets
Utilizes hybrid Transformer-RNN architecture effectively
Abstract
This paper presents Nana-HDR, a new non-attentive non-autoregressive model with hybrid Transformer-based Dense-fuse encoder and RNN-based decoder for TTS. It mainly consists of three parts: Firstly, a novel Dense-fuse encoder with dense connections between basic Transformer blocks for coarse feature fusion and a multi-head attention layer for fine feature fusion. Secondly, a single-layer non-autoregressive RNN-based decoder. Thirdly, a duration predictor instead of an attention model that connects the above hybrid encoder and decoder. Experiments indicate that Nana-HDR gives full play to the advantages of each component, such as strong text encoding ability of Transformer-based encoder, stateful decoding without being bothered by exposure bias and local information preference, and stable alignment provided by duration predictor. Due to these advantages, Nana-HDR achieves competitive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsAttention Is All You Need · Linear Layer · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Dense Connections · Byte Pair Encoding · Label Smoothing · Residual Connection
