Speaker and Style Disentanglement of Speech Based on Contrastive   Predictive Coding Supported Factorized Variational Autoencoder

Yuying Xie; Michael Kuhlmann; Frederik Rautenberg; Zheng-Hua Tan,; Reinhold Haeb-Umbach

arXiv:2409.03520·eess.AS·September 6, 2024·EUSIPCO

Speaker and Style Disentanglement of Speech Based on Contrastive Predictive Coding Supported Factorized Variational Autoencoder

Yuying Xie, Michael Kuhlmann, Frederik Rautenberg, Zheng-Hua Tan,, Reinhold Haeb-Umbach

PDF

Open Access

TL;DR

This paper introduces a novel unsupervised method for disentangling speaker and style information from speech signals using a contrastive predictive coding supported variational autoencoder, enhancing voice conversion capabilities.

Contribution

It proposes a new approach to further disentangle speaker and style features leveraging speaker labels, improving speech representation for voice conversion.

Findings

01

Effective extraction of disentangled speaker and style features

02

Facilitates improved speaker and style conversion in speech

03

Validates the method's effectiveness through experiments

Abstract

Speech signals encompass various information across multiple levels including content, speaker, and style. Disentanglement of these information, although challenging, is important for applications such as voice conversion. The contrastive predictive coding supported factorized variational autoencoder achieves unsupervised disentanglement of a speech signal into speaker and content embeddings by assuming speaker info to be temporally more stable than content-induced variations. However, this assumption may introduce other temporal stable information into the speaker embeddings, like environment or emotion, which we call style. In this work, we propose a method to further disentangle non-content features into distinct speaker and style features, notably by leveraging readily accessible and well-defined speaker labels without the necessity for style labels. Experimental results validate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques

MethodsINFO: An Efficient Optimization Algorithm based on Weighted Mean of Vectors · InfoNCE · Contrastive Predictive Coding