A comprehensive study on self-supervised distillation for speaker representation learning
Zhengyang Chen, Yao Qian, Bing Han, Yanmin Qian, Michael Zeng

TL;DR
This paper investigates self-supervised distillation methods for speaker representation learning, emphasizing data augmentation techniques, and achieves state-of-the-art results on the Voxceleb1 benchmark without using speaker labels.
Contribution
It introduces a novel audio perturbation augmentation strategy that enhances self-distilled self-supervised speaker representation learning.
Findings
Achieved new state-of-the-art EER on Voxceleb1 benchmark
Demonstrated effectiveness of data augmentation in self-supervised learning
Model discards speaker labels during training
Abstract
In real application scenarios, it is often challenging to obtain a large amount of labeled data for speaker representation learning due to speaker privacy concerns. Self-supervised learning with no labels has become a more and more promising way to solve it. Compared with contrastive learning, self-distilled approaches use only positive samples in the loss function and thus are more attractive. In this paper, we present a comprehensive study on self-distilled self-supervised speaker representation learning, especially on critical data augmentation. Our proposed strategy of audio perturbation augmentation has pushed the performance of the speaker representation to a new limit. The experimental results show that our model can achieve a new SoTA on Voxceleb1 speaker verification evaluation benchmark ( i.e., equal error rate (EER) 2.505%, 2.473%, and 4.791% for trial Vox1-O, Vox1-E and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
