Target Speaker ASR with Whisper
Alexander Polok, Dominik Klement, Matthew Wiesner, Sanjeev Khudanpur,, Jan \v{C}ernock\'y, Luk\'a\v{s} Burget

TL;DR
This paper introduces a simple yet effective method to adapt large single-speaker ASR models like Whisper for target speaker recognition by conditioning on diarization outputs, significantly improving performance.
Contribution
The authors demonstrate that adding a bias term conditioned on diarization outputs enables existing ASR models to perform target speaker recognition without extensive retraining.
Findings
Outperforms baseline speech separation and diarization cascade by 12.9% absolute ORC-WER
Supports speaker-attributed ASR by sequential transcript generation
Enables target speaker ASR with minimal modifications to existing models
Abstract
We propose a novel approach to enable the use of large, single-speaker ASR models, such as Whisper, for target speaker ASR. The key claim of this method is that it is much easier to model relative differences among speakers by learning to condition on frame-level diarization outputs than to learn the space of all speaker embeddings. We find that adding even a single bias term per diarization output type before the first transformer block can transform single-speaker ASR models into target-speaker ASR models. Our approach also supports speaker-attributed ASR by sequentially generating transcripts for each speaker in a diarization output. This simplified method outperforms baseline speech separation and diarization cascade by 12.9 % absolute ORC-WER on the NOTSOFAR-1 dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDispute Resolution and Class Actions
