Information Sieve: Content Leakage Reduction in End-to-End Prosody For Expressive Speech Synthesis
Xudong Dai, Cheng Gong, Longbiao Wang, Kaili Zhang

TL;DR
This paper introduces an unsupervised 'information sieve' method that reduces content leakage in expressive speech synthesis by focusing style embeddings on style rather than textual content, improving prosody transfer.
Contribution
The proposed method employs a downsample-upsample filter and instance normalization to effectively mitigate content leakage without auxiliary supervision, enhancing style transfer in TTS systems.
Findings
Lower word error rate (WER) demonstrates reduced content leakage.
Listening tests show preserved prosody transferability.
Outperforms baseline models like GST-Tacotron and ASR-guided Tacotron.
Abstract
Expressive neural text-to-speech (TTS) systems incorporate a style encoder to learn a latent embedding as the style information. However, this embedding process may encode redundant textual information. This phenomenon is called content leakage. Researchers have attempted to resolve this problem by adding an ASR or other auxiliary supervision loss functions. In this study, we propose an unsupervised method called the "information sieve" to reduce the effect of content leakage in prosody transfer. The rationale of this approach is that the style encoder can be forced to focus on style information rather than on textual information contained in the reference speech by a well-designed downsample-upsample filter, i.e., the extracted style embeddings can be downsampled at a certain interval and then upsampled by duplication. Furthermore, we used instance normalization in convolution layers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and dialogue systems
