Self-Supervised Learning for Speaker Recognition: A study and review

Theo Lepage; Reda Dehak

arXiv:2602.10829·eess.AS·February 12, 2026

Self-Supervised Learning for Speaker Recognition: A study and review

Theo Lepage, Reda Dehak

PDF

Open Access

TL;DR

This paper reviews and analyzes the application of self-supervised learning techniques, originally from computer vision, to speaker recognition, highlighting their effectiveness, challenges, and recent advancements in the field.

Contribution

It provides a comprehensive review of SSL frameworks adapted for speaker recognition, comparing their performance and analyzing key hyperparameters and components.

Findings

01

DINO achieves the best downstream performance.

02

SimCLR and MoCo are robust and less prone to collapse.

03

Hyperparameter sensitivity affects SSL performance in SR.

Abstract

Deep learning models trained in a supervised setting have revolutionized audio and speech processing. However, their performance inherently depends on the quantity of human-annotated data, making them costly to scale and prone to poor generalization under unseen conditions. To address these challenges, Self-Supervised Learning (SSL) has emerged as a promising paradigm, leveraging vast amounts of unlabeled data to learn relevant representations. The application of SSL for Automatic Speech Recognition (ASR) has been extensively studied, but research on other downstream tasks, notably Speaker Recognition (SR), remains in its early stages. This work describes major SSL instance-invariance frameworks (e.g., SimCLR, MoCo, and DINO), initially developed for computer vision, along with their adaptation to SR. Various SSL methods for SR, proposed in the literature and built upon these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing