DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition
Alexander Polok, Dominik Klement, Martin Kocour, Jiangyu Han, Federico, Landini, Bolaji Yusuf, Matthew Wiesner, Sanjeev Khudanpur, Jan, \v{C}ernock\'y, Luk\'a\v{s} Burget

TL;DR
DiCoW introduces a diarization-conditioned approach to target-speaker ASR that enhances multi-speaker transcription accuracy by integrating diarization outputs directly into pre-trained models, improving generalization to unseen speakers and overlapping speech.
Contribution
This work presents DiCoW, a novel method that conditions Whisper on diarization outputs, eliminating the need for speaker embeddings and enabling better multi-speaker ASR performance.
Findings
Improves target-speaker ASR accuracy in multi-speaker environments.
Enhances generalization to unseen speakers and overlapping speech.
Maintains robustness on single-speaker data.
Abstract
Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW extends the pre-trained Whisper model by integrating diarization labels directly, eliminating reliance on speaker embeddings and reducing the need for extensive speaker-specific training data. Our method introduces frame-level diarization-dependent transformations (FDDT) and query-key biasing (QKb) techniques to refine the model's focus on target speakers while effectively handling overlapping speech. By leveraging diarization outputs as conditioning signals, DiCoW simplifies the workflow for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsFocus
