Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS
Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li

TL;DR
This paper enhances Tacotron-based speech synthesis by explicitly modeling prosodic phrasing through multi-task learning, leading to improved voice quality especially for long sentences in Chinese and Mongolian languages.
Contribution
It introduces the first multi-task learning approach for Tacotron TTS that jointly predicts Mel spectrum and prosodic phrase breaks.
Findings
Improved voice quality in synthesized speech.
Effective prosodic phrasing modeling for long sentences.
Consistent results across Chinese and Mongolian languages.
Abstract
Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks. To our best knowledge, this is the first implementation of multi-task learning for Tacotron based TTS with a prosodic phrasing model. Experiments show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSigmoid Activation · Highway Layer · Max Pooling · Highway Network · Griffin-Lim Algorithm · Gated Recurrent Unit · Convolution · Dense Connections · Residual GRU · Tanh Activation
