Applying the Information Bottleneck Principle to Prosodic Representation Learning
Guangyan Zhang, Ying Qin, Daxin Tan, Tan Lee

TL;DR
This paper introduces a neural speech generation model that applies the information bottleneck principle to learn controllable, word-level prosodic representations capable of speech reconstruction and prosody transfer.
Contribution
It proposes a novel IB-based neural network with a modified VQ-VAE layer for learning and controlling prosodic representations in speech generation.
Findings
Effective prosody transfer demonstrated
IB capacity tuning improves representation quality
Model achieves high speech reconstruction fidelity
Abstract
This paper describes a novel design of a neural network-based speech generation model for learning prosodic representation.The problem of representation learning is formulated according to the information bottleneck (IB) principle. A modified VQ-VAE quantized layer is incorporated in the speech generation model to control the IB capacity and adjust the balance between reconstruction power and disentangle capability of the learned representation. The proposed model is able to learn word-level prosodic representations from speech data. With an optimized IB capacity, the learned representations not only are adequate to reconstruct the original speech but also can be used to transfer the prosody onto different textual content. Extensive results of the objective and subjective evaluation are presented to demonstrate the effect of IB capacity control, the effectiveness, and potential usage of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsVQ-VAE
