Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker Recognition
Shuai Wang, Qibing Bai, Qi Liu, Jianwei Yu, Zhengyang Chen, Bing Han,, Yanmin Qian, Haizhou Li

TL;DR
This paper demonstrates that self-supervised DINO training on large-scale in-the-wild speech data improves speaker recognition performance and introduces a confidence-based data filtering method for better pretraining data quality.
Contribution
It shows the effectiveness of DINO self-supervised learning on large in-the-wild datasets and proposes a confidence-based filtering algorithm to enhance pretraining data quality.
Findings
DINO training improves speaker recognition accuracy.
Confidence-based data filtering enhances pretraining efficiency.
Pretrained models and tools will be publicly available.
Abstract
Current speaker recognition systems primarily rely on supervised approaches, constrained by the scale of labeled datasets. To boost the system performance, researchers leverage large pretrained models such as WavLM to transfer learned high-level features to the downstream speaker recognition task. However, this approach introduces extra parameters as the pretrained model remains in the inference stage. Another group of researchers directly apply self-supervised methods such as DINO to speaker embedding learning, yet they have not explored its potential on large-scale in-the-wild datasets. In this paper, we present the effectiveness of DINO training on the large-scale WenetSpeech dataset and its transferability in enhancing the supervised system performance on the CNCeleb dataset. Additionally, we introduce a confidence-based data filtering algorithm to remove unreliable data from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
Methodsself-DIstillation with NO labels
