Information Sieve: Content Leakage Reduction in End-to-End Prosody For   Expressive Speech Synthesis

Xudong Dai; Cheng Gong; Longbiao Wang; Kaili Zhang

arXiv:2108.01831·cs.SD·August 5, 2021

Information Sieve: Content Leakage Reduction in End-to-End Prosody For Expressive Speech Synthesis

Xudong Dai, Cheng Gong, Longbiao Wang, Kaili Zhang

PDF

Open Access

TL;DR

This paper introduces an unsupervised 'information sieve' method that reduces content leakage in expressive speech synthesis by focusing style embeddings on style rather than textual content, improving prosody transfer.

Contribution

The proposed method employs a downsample-upsample filter and instance normalization to effectively mitigate content leakage without auxiliary supervision, enhancing style transfer in TTS systems.

Findings

01

Lower word error rate (WER) demonstrates reduced content leakage.

02

Listening tests show preserved prosody transferability.

03

Outperforms baseline models like GST-Tacotron and ASR-guided Tacotron.

Abstract

Expressive neural text-to-speech (TTS) systems incorporate a style encoder to learn a latent embedding as the style information. However, this embedding process may encode redundant textual information. This phenomenon is called content leakage. Researchers have attempted to resolve this problem by adding an ASR or other auxiliary supervision loss functions. In this study, we propose an unsupervised method called the "information sieve" to reduce the effect of content leakage in prosody transfer. The rationale of this approach is that the style encoder can be forced to focus on style information rather than on textual information contained in the reference speech by a well-designed downsample-upsample filter, i.e., the extracted style embeddings can be downsampled at a certain interval and then upsampled by duplication. Furthermore, we used instance normalization in convolution layers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and dialogue systems