LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization
M\'at\'e Gedeon, P\'eter Mihajlik

TL;DR
LibriConvo is a realistic, simulated multi-speaker conversational dataset designed to improve training and evaluation of speech recognition and diarization systems, featuring semantic coherence and natural timing.
Contribution
It introduces a novel pipeline for creating realistic multi-speaker conversations from read literature, enhancing acoustic realism and contextual consistency for speech processing research.
Findings
Sortformer outperforms pyannote in diarization.
Fast Conformer-CTC achieves 7.29% WER on LibriConvo.
Dataset contains 240.1 hours of dialogues with 830 speakers.
Abstract
We introduce LibriConvo, a simulated multi-speaker conversational dataset based on speaker-aware conversation simulation (SASC), designed to support training and evaluation of speaker diarization and automatic speech recognition (ASR) systems. Unlike prior resources that mostly rely on semantically disconnected utterances and implausible temporal gaps, LibriConvo ensures semantic coherence and realistic conversational timing. Our pipeline leverages CallHome with external VAD for reliable boundaries, applies compression to reduce unnaturally long silences, and organizes LibriTTS utterances by book to maintain contextual consistency. Acoustic realism is enhanced via a novel room impulse response selection procedure that ranks speaker-microphone configurations by spatial plausibility, balancing realism and diversity. The dataset comprises 240.1 hours across 1,496 dialogues with 830 unique…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
