Improved disentangled speech representations using contrastive learning   in factorized hierarchical variational autoencoder

Yuying Xie; Thomas Arildsen; Zheng-Hua Tan

arXiv:2211.08191·eess.AS·June 16, 2023

Improved disentangled speech representations using contrastive learning in factorized hierarchical variational autoencoder

Yuying Xie, Thomas Arildsen, Zheng-Hua Tan

PDF

Open Access

TL;DR

This paper enhances disentangled speech representations by integrating contrastive learning into a hierarchical variational autoencoder, improving speaker and content feature extraction for voice conversion without increasing testing costs.

Contribution

It introduces contrastive learning into FHVAE training to better disentangle speaker and content attributes, leading to improved speech representation quality.

Findings

01

Enhanced speaker verification and identification accuracy.

02

Improved speech recognition performance.

03

Better voice conversion results with more realistic fake speech detection.

Abstract

Leveraging the fact that speaker identity and content vary on different time scales, \acrlong{fhvae} (\acrshort{fhvae}) uses different latent variables to symbolize these two attributes. Disentanglement of these attributes is carried out by different prior settings of the corresponding latent variables. For the prior of speaker identity variable, \acrshort{fhvae} assumes it is a Gaussian distribution with an utterance-scale varying mean and a fixed variance. By setting a small fixed variance, the training process promotes identity variables within one utterance gathering close to the mean of their prior. However, this constraint is relatively weak, as the mean of the prior changes between utterances. Therefore, we introduce contrastive learning into the \acrshort{fhvae} framework, to make the speaker identity variables gathering when representing the same speaker, while distancing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsContrastive Learning