Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in   Speaker Recognition

Shuai Wang; Qibing Bai; Qi Liu; Jianwei Yu; Zhengyang Chen; Bing Han,; Yanmin Qian; Haizhou Li

arXiv:2309.11730·eess.AS·September 28, 2023

Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker Recognition

Shuai Wang, Qibing Bai, Qi Liu, Jianwei Yu, Zhengyang Chen, Bing Han,, Yanmin Qian, Haizhou Li

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that self-supervised DINO training on large-scale in-the-wild speech data improves speaker recognition performance and introduces a confidence-based data filtering method for better pretraining data quality.

Contribution

It shows the effectiveness of DINO self-supervised learning on large in-the-wild datasets and proposes a confidence-based filtering algorithm to enhance pretraining data quality.

Findings

01

DINO training improves speaker recognition accuracy.

02

Confidence-based data filtering enhances pretraining efficiency.

03

Pretrained models and tools will be publicly available.

Abstract

Current speaker recognition systems primarily rely on supervised approaches, constrained by the scale of labeled datasets. To boost the system performance, researchers leverage large pretrained models such as WavLM to transfer learned high-level features to the downstream speaker recognition task. However, this approach introduces extra parameters as the pretrained model remains in the inference stage. Another group of researchers directly apply self-supervised methods such as DINO to speaker embedding learning, yet they have not explored its potential on large-scale in-the-wild datasets. In this paper, we present the effectiveness of DINO training on the large-scale WenetSpeech dataset and its transferability in enhancing the supervised system performance on the CNCeleb dataset. Additionally, we introduce a confidence-based data filtering algorithm to remove unreliable data from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wenet-e2e/wespeaker
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

Methodsself-DIstillation with NO labels