Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based   TTS

Rui Liu; Berrak Sisman; Feilong Bao; Guanglai Gao; Haizhou Li

arXiv:2008.05284·eess.AS·February 9, 2021

Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li

PDF

TL;DR

This paper enhances Tacotron-based speech synthesis by explicitly modeling prosodic phrasing through multi-task learning, leading to improved voice quality especially for long sentences in Chinese and Mongolian languages.

Contribution

It introduces the first multi-task learning approach for Tacotron TTS that jointly predicts Mel spectrum and prosodic phrase breaks.

Findings

01

Improved voice quality in synthesized speech.

02

Effective prosodic phrasing modeling for long sentences.

03

Consistent results across Chinese and Mongolian languages.

Abstract

Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks. To our best knowledge, this is the first implementation of multi-task learning for Tacotron based TTS with a prosodic phrasing model. Experiments show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSigmoid Activation · Highway Layer · Max Pooling · Highway Network · Griffin-Lim Algorithm · Gated Recurrent Unit · Convolution · Dense Connections · Residual GRU · Tanh Activation