C3-DINO: Joint Contrastive and Non-contrastive Self-Supervised Learning for Speaker Verification
Chunlei Zhang, Dong Yu

TL;DR
This paper introduces C3-DINO, a novel SSL framework combining contrastive and non-contrastive methods, to improve speaker verification accuracy by addressing false negatives and leveraging negative sample free learning.
Contribution
It proposes a multi-stage class-collision correction method and employs a negative sample free SSL objective, achieving state-of-the-art results in speaker verification.
Findings
C3-DINO achieves 2.5% EER on Voxceleb1, outperforming previous SSL systems.
The proposed methods significantly narrow the performance gap between SSL and supervised approaches.
Experimental results validate the effectiveness of combining contrastive and non-contrastive SSL techniques.
Abstract
Self-supervised learning (SSL) has drawn an increased attention in the field of speech processing. Recent studies have demonstrated that contrastive learning is able to learn discriminative speaker embeddings in a self-supervised manner. However, base contrastive self-supervised learning (CSSL) assumes that the pairs generated from a view of anchor instance and any view of other instances are all negative, which introduces many false negative pairs in constructing the loss function. The problem is referred as -, which remains as one major issue that impedes the CSSL based speaker verification (SV) systems from achieving better performances. In the meanwhile, studies reveal that negative sample free SSL frameworks perform well in learning speaker or image representations. In this study, we investigate SSL techniques that lead to an improved SV performance. We first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsTest · Contrastive Learning · Balanced Selection
