Contrastive Speaker Embedding With Sequential Disentanglement
Youzhi Tu, Man-Wai Mak, and Jen-Tzung Chien

TL;DR
This paper introduces a contrastive learning framework with sequential disentanglement to improve speaker embeddings by removing linguistic content, leading to more speaker-discriminative representations.
Contribution
It proposes a novel DSVAE integrated into SimCLR to disentangle speaker and content factors in speech embeddings, enhancing speaker verification performance.
Findings
Outperforms standard SimCLR on VoxCeleb1-test
Content-invariant speaker embeddings are achieved
Sequential disentanglement improves speaker discrimination
Abstract
Contrastive speaker embedding assumes that the contrast between the positive and negative pairs of speech segments is attributed to speaker identity only. However, this assumption is incorrect because speech signals contain not only speaker identity but also linguistic content. In this paper, we propose a contrastive learning framework with sequential disentanglement to remove linguistic content by incorporating a disentangled sequential variational autoencoder (DSVAE) into the conventional SimCLR framework. The DSVAE aims to disentangle speaker factors from content factors in an embedding space so that only the speaker factors are used for constructing a contrastive loss objective. Because content factors have been removed from the contrastive learning, the resulting speaker embeddings will be content-invariant. Experimental results on VoxCeleb1-test show that the proposed method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Average Pooling · Batch Normalization · Dense Connections · Residual Block · Global Average Pooling · Residual Connection · 1x1 Convolution · Bottleneck Residual Block · Feedforward Network
