Contrastive Separative Coding for Self-supervised Representation   Learning

Jun Wang; Max W. Y. Lam; Dan Su; Dong Yu

arXiv:2103.00816·eess.AS·March 2, 2021

Contrastive Separative Coding for Self-supervised Representation Learning

Jun Wang, Max W. Y. Lam, Dan Su, Dong Yu

PDF

Open Access

TL;DR

This paper introduces Contrastive Separative Coding (CSC), a self-supervised learning method that extracts robust speech representations by separating target signals from interfering noise using a multi-task encoder, cross-attention, and a novel contrastive loss, improving speaker verification in noisy conditions.

Contribution

The paper proposes a novel self-supervised learning framework that focuses on separating signals from interference, with a new contrastive loss that does not require negative sampling, enhancing robustness in speech representation learning.

Findings

01

Achieves strong speaker verification performance in adverse conditions.

02

Introduces a negative-sampling-free contrastive loss.

03

Demonstrates effectiveness of cross-attention in separating signals.

Abstract

To extract robust deep representations from long sequential modeling of speech data, we propose a self-supervised learning approach, namely Contrastive Separative Coding (CSC). Our key finding is to learn such representations by separating the target signal from contrastive interfering signals. First, a multi-task separative encoder is built to extract shared separable and discriminative embedding; secondly, we propose a powerful cross-attention mechanism performed over speaker representations across various interfering conditions, allowing the model to focus on and globally aggregate the most critical information to answer the "query" (current bottom-up embedding) while paying less attention to interfering, noisy, or irrelevant parts; lastly, we form a new probabilistic contrastive loss which estimates and maximizes the mutual information between the representations and the global…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing