Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization?

Shota Horiguchi; Naohiro Tawara; Takanori Ashihara; Atsushi Ando; Marc Delcroix

arXiv:2507.09226·eess.AS·August 26, 2025

Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization?

Shota Horiguchi, Naohiro Tawara, Takanori Ashihara, Atsushi Ando, Marc Delcroix

PDF

Open Access 1 Models

TL;DR

This paper investigates the impact of boundary looseness in multi-speaker ASR datasets on neural speaker diarization performance, highlighting issues with dataset consistency and proposing standardized boundary alignment to improve results.

Contribution

It demonstrates that boundary looseness in ASR datasets hampers diarization accuracy and shows that using standardized tight boundaries enhances both diarization and ASR performance.

Findings

01

Looseness of segment boundaries reduces diarization accuracy.

02

Models trained on loose boundaries do not generalize well to other datasets.

03

Standardized boundary alignment improves diarization and ASR performance.

Abstract

Neural speaker diarization is widely used for overlap-aware speaker diarization, but it requires large multi-speaker datasets for training. To meet this data requirement, large datasets are often constructed by combining multiple corpora, including those originally designed for multi-speaker automatic speech recognition (ASR). However, ASR datasets often feature loosely defined segment boundaries that do not align with the stricter conventions of diarization benchmarks. In this work, we show that such boundary looseness significantly impacts the diarization error rate, reducing evaluation reliability. We also reveal that models trained on data with varying boundary precision tend to learn dataset-specific looseness, leading to poor generalization across out-of-domain datasets. Training with standardized tight boundaries via forced alignment improves not only diarization performance,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
nvidia/diar_streaming_sortformer_4spk-v2.1
model· 6.5k dl· ♡ 59
6.5k dl♡ 59

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques