Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision
Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Shiliang, Zhang, Wen Wang

TL;DR
This paper introduces SDPN, a self-supervised learning method for speaker verification that uses prototypes and diversity regularization to learn robust speaker representations without labels, achieving state-of-the-art results.
Contribution
The paper proposes a novel self-distillation prototypes network with diversity regularization for label-free speaker verification, surpassing previous methods on VoxCeleb benchmarks.
Findings
SDPN achieves state-of-the-art EERs on VoxCeleb1 benchmarks.
The diversity regularization effectively prevents model collapse.
SDPN does not require speaker labels during training.
Abstract
Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an utterance to the same prototypes as the representation of the original view, thereby enabling effective knowledge transfer between the views. Originally, due to the lack of negative pairs in the SDPN training process, the network tends to align positive pairs very closely in the embedding space, a phenomenon known as model collapse. To alleviate this problem, we introduce a diversity regularization term to embeddings in SDPN. Comprehensive experiments on the VoxCeleb datasets demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsALIGN
