Speaker-Aware Simulation Improves Conversational Speech Recognition
M\'at\'e Gedeon, P\'eter Mihajlik

TL;DR
This paper demonstrates that speaker-aware simulated conversations significantly enhance Hungarian conversational speech recognition, with the extended C-SASC model further improving local temporal modeling and error rates.
Contribution
The study adapts the speaker-aware simulation framework to Hungarian, introduces C-SASC with pause modeling, and evaluates its effectiveness across diverse conversational datasets.
Findings
Speaker-aware simulation improves ASR performance.
C-SASC yields systematic gains in character error rates.
Effectiveness depends on match between source and target conversational statistics.
Abstract
Automatic speech recognition (ASR) for conversational speech remains challenging due to the limited availability of large-scale, well-annotated multi-speaker dialogue data and the complex temporal dynamics of natural interactions. Speaker-aware simulated conversations (SASC) offer an effective data augmentation strategy by transforming single-speaker recordings into realistic multi-speaker dialogues. However, prior work has primarily focused on English data, leaving questions about the applicability to lower-resource languages. In this paper, we adapt and implement the SASC framework for Hungarian conversational ASR. We further propose C-SASC, an extended variant that incorporates pause modeling conditioned on utterance duration, enabling a more faithful representation of local temporal dependencies observed in human conversation while retaining the simplicity and efficiency of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Topic Modeling
