C3-DINO: Joint Contrastive and Non-contrastive Self-Supervised Learning   for Speaker Verification

Chunlei Zhang; Dong Yu

arXiv:2208.07446·eess.AS·November 23, 2022

C3-DINO: Joint Contrastive and Non-contrastive Self-Supervised Learning for Speaker Verification

Chunlei Zhang, Dong Yu

PDF

Open Access

TL;DR

This paper introduces C3-DINO, a novel SSL framework combining contrastive and non-contrastive methods, to improve speaker verification accuracy by addressing false negatives and leveraging negative sample free learning.

Contribution

It proposes a multi-stage class-collision correction method and employs a negative sample free SSL objective, achieving state-of-the-art results in speaker verification.

Findings

01

C3-DINO achieves 2.5% EER on Voxceleb1, outperforming previous SSL systems.

02

The proposed methods significantly narrow the performance gap between SSL and supervised approaches.

03

Experimental results validate the effectiveness of combining contrastive and non-contrastive SSL techniques.

Abstract

Self-supervised learning (SSL) has drawn an increased attention in the field of speech processing. Recent studies have demonstrated that contrastive learning is able to learn discriminative speaker embeddings in a self-supervised manner. However, base contrastive self-supervised learning (CSSL) assumes that the pairs generated from a view of anchor instance and any view of other instances are all negative, which introduces many false negative pairs in constructing the loss function. The problem is referred as $c l a ss$ - $co l l i s i o n$ , which remains as one major issue that impedes the CSSL based speaker verification (SV) systems from achieving better performances. In the meanwhile, studies reveal that negative sample free SSL frameworks perform well in learning speaker or image representations. In this study, we investigate SSL techniques that lead to an improved SV performance. We first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsTest · Contrastive Learning · Balanced Selection