Expressive TTS Training with Frame and Style Reconstruction Loss
Rui Liu, Berrak Sisman, Guanglai Gao, Haizhou Li

TL;DR
This paper introduces a novel Tacotron training method that enhances speech expressiveness by combining spectral and style reconstruction losses without needing prosody annotations, outperforming existing models.
Contribution
It proposes a new training strategy using utterance-level perceptual style loss, departing from style token methods, to improve TTS expressiveness without explicit prosody modeling.
Findings
Outperforms state-of-the-art baseline in naturalness.
Achieves higher expressiveness in synthesized speech.
First to incorporate utterance-level perceptual loss in Tacotron training.
Abstract
We propose a novel training strategy for Tacotron-based text-to-speech (TTS) system to improve the expressiveness of speech. One of the key challenges in prosody modeling is the lack of reference that makes explicit modeling difficult. The proposed technique doesn't require prosody annotations from training data. It doesn't attempt to model prosody explicitly either, but rather encodes the association between input text and its prosody styles using a Tacotron-based TTS framework. Our proposed idea marks a departure from the style token paradigm where prosody is explicitly modeled by a bank of prosody embeddings. The proposed training strategy adopts a combination of two objective functions: 1) frame level reconstruction loss, that is calculated between the synthesized and target spectral features; 2) utterance level style reconstruction loss, that is calculated between the deep style…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
