SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper

Alexander Polok; Dominik Klement; Samuele Cornell; Matthew Wiesner; Jan \v{C}ernock\'y; Sanjeev Khudanpur; Luk\'a\v{s} Burget

arXiv:2601.19194·eess.AS·January 28, 2026

SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper

Alexander Polok, Dominik Klement, Samuele Cornell, Matthew Wiesner, Jan \v{C}ernock\'y, Sanjeev Khudanpur, Luk\'a\v{s} Burget

PDF

Open Access 1 Models

TL;DR

SE-DiCoW enhances speaker-attributed ASR by using self-enrolled speaker segments for better conditioning, significantly improving performance across multi-speaker, multi-domain datasets.

Contribution

The paper introduces SE-DiCoW, a novel method that uses self-enrollment segments for improved speaker conditioning in diarization-conditioned ASR.

Findings

01

Reduces macro-averaged tcpWER by 52.4% on EMMA MT-ASR benchmark.

02

Addresses ambiguity in STNO masks with self-enrollment segments.

03

Improves performance across multi-speaker and multi-domain datasets.

Abstract

Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a major challenge. While some approaches achieve strong performance when fine-tuned on specific domains, few systems generalize well across out-of-domain datasets. Our prior work, Diarization-Conditioned Whisper (DiCoW), leverages speaker diarization outputs as conditioning information and, with minimal fine-tuning, demonstrated strong multilingual and multi-domain performance. In this paper, we address a key limitation of DiCoW: ambiguity in Silence-Target-Non-target-Overlap (STNO) masks, where two or more fully overlapping speakers may have nearly identical conditioning despite differing transcriptions. We introduce SE-DiCoW (Self-Enrolled Diarization-Conditioned Whisper), which uses diarization output to locate an enrollment segment anywhere in the conversation where the target speaker is most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
BUT-FIT/DiCoW_v3_3
model· 445 dl
445 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis