Pretraining Multi-Speaker Identification for Neural Speaker Diarization
Shota Horiguchi, Atsushi Ando, Marc Delcroix, Naohiro Tawara

TL;DR
This paper introduces a pretraining method for multi-speaker identification that reduces reliance on large-scale simulated conversational data, enabling accurate and lightweight neural speaker diarization.
Contribution
It proposes pretraining a model to identify multiple speakers in overlapped mixtures, bypassing the need for extensive simulated conversational datasets.
Findings
Achieves high accuracy with a lightweight model
Eliminates the need for large-scale simulated data
Leverages large-scale speaker recognition datasets effectively
Abstract
End-to-end speaker diarization enables accurate overlap-aware diarization by jointly estimating multiple speakers' speech activities in parallel. This approach is data-hungry, requiring a large amount of labeled conversational data, which cannot be fully obtained from real datasets alone. To address this issue, large-scale simulated data is often used for pretraining, but it requires enormous storage and I/O capacity, and simulating data that closely resembles real conversations remains challenging. In this paper, we propose pretraining a model to identify multiple speakers from an input fully overlapped mixture as an alternative to pretraining a diarization model. This method eliminates the need to prepare a large-scale simulated dataset while leveraging large-scale speaker recognition datasets for training. Through comprehensive experiments, we demonstrate that the proposed method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
