Mind the Gap: Impact of Synthetic Conversational Data on Multi-Talker ASR and Speaker Diarization
Alexander Polok, Ivan Medennikov, Jan \v{C}ernock\'y, Shinji Watanabe, Luk\'a\v{s} Burget, Samuele Cornell

TL;DR
This paper investigates how different synthetic data generation strategies impact multi-talker ASR and speaker diarization, revealing task-dependent effects and the benefits of combining synthetic with real data.
Contribution
It introduces FastMSS, an efficient open-source simulator, and provides a comprehensive analysis of simulation choices on system performance.
Findings
Increasing speech overlap benefits ASR but degrades diarization.
Broad source diversity outperforms exact domain matching.
Synthetic data combined with real recordings improves performance.
Abstract
Recent breakthroughs in multi-talker ASR (MT-ASR) and speaker diarization (SD) rely on synthetic data to mitigate the scarcity of large-scale conversational recordings, yet the impact of specific simulation choices remains poorly understood. To mind the gap between simulated mixtures and real-world interactions, we present a study of synthetic data generation for leading MT-ASR (DiCoW) and SD (Sortformer) systems. By introducing FastMSS, a highly efficient open-source simulator, we analyze turn-taking dynamics, source domain, acoustic augmentation, and data mixing strategies. Our findings reveal that optimal simulation recipes are highly task-dependent: increasing speech overlap benefits ASR but degrades diarization. Furthermore, broad source diversity consistently outperforms exact domain matching. Ultimately, synthetic-only training approaches real-data baselines, and combining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
