Curriculum learning for self-supervised speaker verification
Hee-Soo Heo, Jee-weon Jung, Jingu Kang, Youngki Kwon, You Jin Kim,, Bong-Jin Lee, Joon Son Chung

TL;DR
This paper introduces curriculum learning strategies for self-supervised speaker verification, progressively increasing data complexity to improve speaker representations, achieving state-of-the-art results on VoxCeleb1.
Contribution
It proposes two novel curriculum learning methods within a self-supervised framework to enhance speaker verification without labels.
Findings
Achieved 4.47% EER with single-phase training
Further improved to 1.84% EER with fine-tuning
Demonstrated effectiveness of curriculum strategies on VoxCeleb1
Abstract
The goal of this paper is to train effective self-supervised speaker representations without identity labels. We propose two curriculum learning strategies within a self-supervised learning framework. The first strategy aims to gradually increase the number of speakers in the training phase by enlarging the used portion of the train dataset. The second strategy applies various data augmentations to more utterances within a mini-batch as the training proceeds. A range of experiments conducted using the DINO self-supervised framework on the VoxCeleb1 evaluation protocol demonstrates the effectiveness of our proposed curriculum learning strategies. We report a competitive equal error rate of 4.47% with a single-phase training, and we also demonstrate that the performance further improves to 1.84% by fine-tuning on a small labelled dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Layer Normalization · Linear Layer · Dense Connections · Residual Connection · Vision Transformer
