Contrastive Speaker Embedding With Sequential Disentanglement

Youzhi Tu; Man-Wai Mak; and Jen-Tzung Chien

arXiv:2309.13253·eess.AS·September 26, 2023

Contrastive Speaker Embedding With Sequential Disentanglement

Youzhi Tu, Man-Wai Mak, and Jen-Tzung Chien

PDF

Open Access

TL;DR

This paper introduces a contrastive learning framework with sequential disentanglement to improve speaker embeddings by removing linguistic content, leading to more speaker-discriminative representations.

Contribution

It proposes a novel DSVAE integrated into SimCLR to disentangle speaker and content factors in speech embeddings, enhancing speaker verification performance.

Findings

01

Outperforms standard SimCLR on VoxCeleb1-test

02

Content-invariant speaker embeddings are achieved

03

Sequential disentanglement improves speaker discrimination

Abstract

Contrastive speaker embedding assumes that the contrast between the positive and negative pairs of speech segments is attributed to speaker identity only. However, this assumption is incorrect because speech signals contain not only speaker identity but also linguistic content. In this paper, we propose a contrastive learning framework with sequential disentanglement to remove linguistic content by incorporating a disentangled sequential variational autoencoder (DSVAE) into the conventional SimCLR framework. The DSVAE aims to disentangle speaker factors from content factors in an embedding space so that only the speaker factors are used for constructing a contrastive loss objective. Because content factors have been removed from the contrastive learning, the resulting speaker embeddings will be content-invariant. Experimental results on VoxCeleb1-test show that the proposed method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Average Pooling · Batch Normalization · Dense Connections · Residual Block · Global Average Pooling · Residual Connection · 1x1 Convolution · Bottleneck Residual Block · Feedforward Network