Curriculum learning for self-supervised speaker verification

Hee-Soo Heo; Jee-weon Jung; Jingu Kang; Youngki Kwon; You Jin Kim,; Bong-Jin Lee; Joon Son Chung

arXiv:2203.14525·eess.AS·February 15, 2024·Interspeech·1 cites

Curriculum learning for self-supervised speaker verification

Hee-Soo Heo, Jee-weon Jung, Jingu Kang, Youngki Kwon, You Jin Kim,, Bong-Jin Lee, Joon Son Chung

PDF

Open Access

TL;DR

This paper introduces curriculum learning strategies for self-supervised speaker verification, progressively increasing data complexity to improve speaker representations, achieving state-of-the-art results on VoxCeleb1.

Contribution

It proposes two novel curriculum learning methods within a self-supervised framework to enhance speaker verification without labels.

Findings

01

Achieved 4.47% EER with single-phase training

02

Further improved to 1.84% EER with fine-tuning

03

Demonstrated effectiveness of curriculum strategies on VoxCeleb1

Abstract

The goal of this paper is to train effective self-supervised speaker representations without identity labels. We propose two curriculum learning strategies within a self-supervised learning framework. The first strategy aims to gradually increase the number of speakers in the training phase by enlarging the used portion of the train dataset. The second strategy applies various data augmentations to more utterances within a mini-batch as the training proceeds. A range of experiments conducted using the DINO self-supervised framework on the VoxCeleb1 evaluation protocol demonstrates the effectiveness of our proposed curriculum learning strategies. We report a competitive equal error rate of 4.47% with a single-phase training, and we also demonstrate that the performance further improves to 1.84% by fine-tuning on a small labelled dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Layer Normalization · Linear Layer · Dense Connections · Residual Connection · Vision Transformer