On the Use of Self-Supervised Representation Learning for Speaker Diarization and Separation

S\'everin Baroudi; Herv\'e Bredin; Joseph Razik; Ricard Marxer

arXiv:2512.15224·eess.AS·December 18, 2025

On the Use of Self-Supervised Representation Learning for Speaker Diarization and Separation

S\'everin Baroudi, Herv\'e Bredin, Joseph Razik, Ricard Marxer

PDF

Open Access

TL;DR

This paper evaluates the effectiveness of recent self-supervised speech models like wav2vec2.0 and WavLM on speaker diarization and separation tasks, addressing gaps in existing benchmarks and highlighting their potential in low-resource scenarios.

Contribution

It provides a comprehensive assessment of self-supervised speech representations on diarization and separation, revealing current limitations and proposing directions for more diverse evaluations.

Findings

01

Self-supervised models improve speaker diarization and separation performance.

02

Current benchmarks lack diversity in evaluation datasets.

03

There are significant gaps in evaluating these models on real-world tasks.

Abstract

Self-supervised speech models such as wav2vec2.0 and WavLM have been shown to significantly improve the performance of many downstream speech tasks, especially in low-resource settings, over the past few years. Despite this, evaluations on tasks such as Speaker Diarization and Speech Separation remain limited. This paper investigates the quality of recent self-supervised speech representations on these two speaker identity-related tasks, highlighting gaps in the current literature that stem from limitations in the existing benchmarks, particularly the lack of diversity in evaluation datasets and variety in downstream systems associated to both diarization and separation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · ICT in Developing Communities · Face recognition and analysis