Speaker and Style Disentanglement of Speech Based on Contrastive Predictive Coding Supported Factorized Variational Autoencoder
Yuying Xie, Michael Kuhlmann, Frederik Rautenberg, Zheng-Hua Tan,, Reinhold Haeb-Umbach

TL;DR
This paper introduces a novel unsupervised method for disentangling speaker and style information from speech signals using a contrastive predictive coding supported variational autoencoder, enhancing voice conversion capabilities.
Contribution
It proposes a new approach to further disentangle speaker and style features leveraging speaker labels, improving speech representation for voice conversion.
Findings
Effective extraction of disentangled speaker and style features
Facilitates improved speaker and style conversion in speech
Validates the method's effectiveness through experiments
Abstract
Speech signals encompass various information across multiple levels including content, speaker, and style. Disentanglement of these information, although challenging, is important for applications such as voice conversion. The contrastive predictive coding supported factorized variational autoencoder achieves unsupervised disentanglement of a speech signal into speaker and content embeddings by assuming speaker info to be temporally more stable than content-induced variations. However, this assumption may introduce other temporal stable information into the speaker embeddings, like environment or emotion, which we call style. In this work, we propose a method to further disentangle non-content features into distinct speaker and style features, notably by leveraging readily accessible and well-defined speaker labels without the necessity for style labels. Experimental results validate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques
MethodsINFO: An Efficient Optimization Algorithm based on Weighted Mean of Vectors · InfoNCE · Contrastive Predictive Coding
