Mind the Gap: Impact of Synthetic Conversational Data on Multi-Talker ASR and Speaker Diarization

Alexander Polok; Ivan Medennikov; Jan \v{C}ernock\'y; Shinji Watanabe; Luk\'a\v{s} Burget; Samuele Cornell

arXiv:2605.15442·eess.AS·May 18, 2026

Mind the Gap: Impact of Synthetic Conversational Data on Multi-Talker ASR and Speaker Diarization

Alexander Polok, Ivan Medennikov, Jan \v{C}ernock\'y, Shinji Watanabe, Luk\'a\v{s} Burget, Samuele Cornell

PDF

TL;DR

This paper investigates how different synthetic data generation strategies impact multi-talker ASR and speaker diarization, revealing task-dependent effects and the benefits of combining synthetic with real data.

Contribution

It introduces FastMSS, an efficient open-source simulator, and provides a comprehensive analysis of simulation choices on system performance.

Findings

01

Increasing speech overlap benefits ASR but degrades diarization.

02

Broad source diversity outperforms exact domain matching.

03

Synthetic data combined with real recordings improves performance.

Abstract

Recent breakthroughs in multi-talker ASR (MT-ASR) and speaker diarization (SD) rely on synthetic data to mitigate the scarcity of large-scale conversational recordings, yet the impact of specific simulation choices remains poorly understood. To mind the gap between simulated mixtures and real-world interactions, we present a study of synthetic data generation for leading MT-ASR (DiCoW) and SD (Sortformer) systems. By introducing FastMSS, a highly efficient open-source simulator, we analyze turn-taking dynamics, source domain, acoustic augmentation, and data mixing strategies. Our findings reveal that optimal simulation recipes are highly task-dependent: increasing speech overlap benefits ASR but degrades diarization. Furthermore, broad source diversity consistently outperforms exact domain matching. Ultimately, synthetic-only training approaches real-data baselines, and combining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.