Self-Supervised Learning for Speaker Recognition: A study and review
Theo Lepage, Reda Dehak

TL;DR
This paper reviews and analyzes the application of self-supervised learning techniques, originally from computer vision, to speaker recognition, highlighting their effectiveness, challenges, and recent advancements in the field.
Contribution
It provides a comprehensive review of SSL frameworks adapted for speaker recognition, comparing their performance and analyzing key hyperparameters and components.
Findings
DINO achieves the best downstream performance.
SimCLR and MoCo are robust and less prone to collapse.
Hyperparameter sensitivity affects SSL performance in SR.
Abstract
Deep learning models trained in a supervised setting have revolutionized audio and speech processing. However, their performance inherently depends on the quantity of human-annotated data, making them costly to scale and prone to poor generalization under unseen conditions. To address these challenges, Self-Supervised Learning (SSL) has emerged as a promising paradigm, leveraging vast amounts of unlabeled data to learn relevant representations. The application of SSL for Automatic Speech Recognition (ASR) has been extensively studied, but research on other downstream tasks, notably Speaker Recognition (SR), remains in its early stages. This work describes major SSL instance-invariance frameworks (e.g., SimCLR, MoCo, and DINO), initially developed for computer vision, along with their adaptation to SR. Various SSL methods for SR, proposed in the literature and built upon these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
