Self-Distillation Prototypes Network: Learning Robust Speaker   Representations without Supervision

Yafeng Chen; Siqi Zheng; Hui Wang; Luyao Cheng; Qian Chen; Shiliang; Zhang; Wen Wang

arXiv:2406.11169·eess.AS·December 28, 2024·1 cites

Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision

Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Shiliang, Zhang, Wen Wang

PDF

Open Access 2 Repos

TL;DR

This paper introduces SDPN, a self-supervised learning method for speaker verification that uses prototypes and diversity regularization to learn robust speaker representations without labels, achieving state-of-the-art results.

Contribution

The paper proposes a novel self-distillation prototypes network with diversity regularization for label-free speaker verification, surpassing previous methods on VoxCeleb benchmarks.

Findings

01

SDPN achieves state-of-the-art EERs on VoxCeleb1 benchmarks.

02

The diversity regularization effectively prevents model collapse.

03

SDPN does not require speaker labels during training.

Abstract

Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an utterance to the same prototypes as the representation of the original view, thereby enabling effective knowledge transfer between the views. Originally, due to the lack of negative pairs in the SDPN training process, the network tends to align positive pairs very closely in the embedding space, a phenomenon known as model collapse. To alleviate this problem, we introduce a diversity regularization term to embeddings in SDPN. Comprehensive experiments on the VoxCeleb datasets demonstrate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsALIGN