Pushing the limits of self-supervised speaker verification using   regularized distillation framework

Yafeng Chen; Siqi Zheng; Hui Wang; Luyao Cheng; Qian Chen

arXiv:2211.04168·eess.AS·August 4, 2023

Pushing the limits of self-supervised speaker verification using regularized distillation framework

Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces a regularized self-supervised learning framework based on DINO for speaker verification, significantly improving performance and achieving state-of-the-art results without using speaker labels.

Contribution

It proposes two novel regularization techniques for DINO embeddings and explores data augmentation, advancing self-supervised speaker verification methods.

Findings

01

Achieves state-of-the-art results on VoxCeleb datasets.

02

Regularization improves embedding diversity and decorrelation.

03

Effective data augmentation enhances verification performance.

Abstract

Training robust speaker verification systems without speaker labels has long been a challenging task. Previous studies observed a large performance gap between self-supervised and fully supervised methods. In this paper, we apply a non-contrastive self-supervised learning framework called DIstillation with NO labels (DINO) and propose two regularization terms applied to embeddings in DINO. One regularization term guarantees the diversity of the embeddings, while the other regularization term decorrelates the variables of each embedding. The effectiveness of various data augmentation techniques are explored, on both time and frequency domain. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the regularized DINO framework in speaker verification. Our method achieves the state-of-the-art speaker verification performance under a single-stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alibaba-damo-academy/3D-Speaker
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Layer Normalization · Linear Layer · Dense Connections · Residual Connection · Vision Transformer