DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic   Speech Recognition

Alexander Polok; Dominik Klement; Martin Kocour; Jiangyu Han; Federico; Landini; Bolaji Yusuf; Matthew Wiesner; Sanjeev Khudanpur; Jan; \v{C}ernock\'y; Luk\'a\v{s} Burget

arXiv:2501.00114·eess.AS·January 3, 2025

DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition

Alexander Polok, Dominik Klement, Martin Kocour, Jiangyu Han, Federico, Landini, Bolaji Yusuf, Matthew Wiesner, Sanjeev Khudanpur, Jan, \v{C}ernock\'y, Luk\'a\v{s} Burget

PDF

Open Access 1 Repo 1 Models

TL;DR

DiCoW introduces a diarization-conditioned approach to target-speaker ASR that enhances multi-speaker transcription accuracy by integrating diarization outputs directly into pre-trained models, improving generalization to unseen speakers and overlapping speech.

Contribution

This work presents DiCoW, a novel method that conditions Whisper on diarization outputs, eliminating the need for speaker embeddings and enabling better multi-speaker ASR performance.

Findings

01

Improves target-speaker ASR accuracy in multi-speaker environments.

02

Enhances generalization to unseen speakers and overlapping speech.

03

Maintains robustness on single-speaker data.

Abstract

Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW extends the pre-trained Whisper model by integrating diarization labels directly, eliminating reliance on speaker embeddings and reducing the need for extensive speaker-specific training data. Our method introduces frame-level diarization-dependent transformations (FDDT) and query-key biasing (QKb) techniques to refine the model's focus on target speakers while effectively handling overlapping speech. By leveraging diarization outputs as conditioning signals, DiCoW simplifies the workflow for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BUTSpeechFIT/TS-ASR-Whisper
pytorchOfficial

Models

🤗
BUT-FIT/DiCoW_v1
model· 6 dl· ♡ 1
6 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsFocus