SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper
Alexander Polok, Dominik Klement, Samuele Cornell, Matthew Wiesner, Jan \v{C}ernock\'y, Sanjeev Khudanpur, Luk\'a\v{s} Burget

TL;DR
SE-DiCoW enhances speaker-attributed ASR by using self-enrolled speaker segments for better conditioning, significantly improving performance across multi-speaker, multi-domain datasets.
Contribution
The paper introduces SE-DiCoW, a novel method that uses self-enrollment segments for improved speaker conditioning in diarization-conditioned ASR.
Findings
Reduces macro-averaged tcpWER by 52.4% on EMMA MT-ASR benchmark.
Addresses ambiguity in STNO masks with self-enrollment segments.
Improves performance across multi-speaker and multi-domain datasets.
Abstract
Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a major challenge. While some approaches achieve strong performance when fine-tuned on specific domains, few systems generalize well across out-of-domain datasets. Our prior work, Diarization-Conditioned Whisper (DiCoW), leverages speaker diarization outputs as conditioning information and, with minimal fine-tuning, demonstrated strong multilingual and multi-domain performance. In this paper, we address a key limitation of DiCoW: ambiguity in Silence-Target-Non-target-Overlap (STNO) masks, where two or more fully overlapping speakers may have nearly identical conditioning despite differing transcriptions. We introduce SE-DiCoW (Self-Enrolled Diarization-Conditioned Whisper), which uses diarization output to locate an enrollment segment anywhere in the conversation where the target speaker is most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis
