Pretraining Multi-Speaker Identification for Neural Speaker Diarization

Shota Horiguchi; Atsushi Ando; Marc Delcroix; Naohiro Tawara

arXiv:2505.24545·eess.AS·June 2, 2025

Pretraining Multi-Speaker Identification for Neural Speaker Diarization

Shota Horiguchi, Atsushi Ando, Marc Delcroix, Naohiro Tawara

PDF

Open Access

TL;DR

This paper introduces a pretraining method for multi-speaker identification that reduces reliance on large-scale simulated conversational data, enabling accurate and lightweight neural speaker diarization.

Contribution

It proposes pretraining a model to identify multiple speakers in overlapped mixtures, bypassing the need for extensive simulated conversational datasets.

Findings

01

Achieves high accuracy with a lightweight model

02

Eliminates the need for large-scale simulated data

03

Leverages large-scale speaker recognition datasets effectively

Abstract

End-to-end speaker diarization enables accurate overlap-aware diarization by jointly estimating multiple speakers' speech activities in parallel. This approach is data-hungry, requiring a large amount of labeled conversational data, which cannot be fully obtained from real datasets alone. To address this issue, large-scale simulated data is often used for pretraining, but it requires enormous storage and I/O capacity, and simulating data that closely resembles real conversations remains challenging. In this paper, we propose pretraining a model to identify multiple speakers from an input fully overlapped mixture as an alternative to pretraining a diarization model. This method eliminates the need to prepare a large-scale simulated dataset while leveraging large-scale speaker recognition datasets for training. Through comprehensive experiments, we demonstrate that the proposed method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing