Contrastive Separative Coding for Self-supervised Representation Learning
Jun Wang, Max W. Y. Lam, Dan Su, Dong Yu

TL;DR
This paper introduces Contrastive Separative Coding (CSC), a self-supervised learning method that extracts robust speech representations by separating target signals from interfering noise using a multi-task encoder, cross-attention, and a novel contrastive loss, improving speaker verification in noisy conditions.
Contribution
The paper proposes a novel self-supervised learning framework that focuses on separating signals from interference, with a new contrastive loss that does not require negative sampling, enhancing robustness in speech representation learning.
Findings
Achieves strong speaker verification performance in adverse conditions.
Introduces a negative-sampling-free contrastive loss.
Demonstrates effectiveness of cross-attention in separating signals.
Abstract
To extract robust deep representations from long sequential modeling of speech data, we propose a self-supervised learning approach, namely Contrastive Separative Coding (CSC). Our key finding is to learn such representations by separating the target signal from contrastive interfering signals. First, a multi-task separative encoder is built to extract shared separable and discriminative embedding; secondly, we propose a powerful cross-attention mechanism performed over speaker representations across various interfering conditions, allowing the model to focus on and globally aggregate the most critical information to answer the "query" (current bottom-up embedding) while paying less attention to interfering, noisy, or irrelevant parts; lastly, we form a new probabilistic contrastive loss which estimates and maximizes the mutual information between the representations and the global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
