Target Speaker ASR with Whisper

Alexander Polok; Dominik Klement; Matthew Wiesner; Sanjeev Khudanpur,; Jan \v{C}ernock\'y; Luk\'a\v{s} Burget

arXiv:2409.09543·eess.AS·January 17, 2025

Target Speaker ASR with Whisper

Alexander Polok, Dominik Klement, Matthew Wiesner, Sanjeev Khudanpur,, Jan \v{C}ernock\'y, Luk\'a\v{s} Burget

PDF

Open Access 1 Repo

TL;DR

This paper introduces a simple yet effective method to adapt large single-speaker ASR models like Whisper for target speaker recognition by conditioning on diarization outputs, significantly improving performance.

Contribution

The authors demonstrate that adding a bias term conditioned on diarization outputs enables existing ASR models to perform target speaker recognition without extensive retraining.

Findings

01

Outperforms baseline speech separation and diarization cascade by 12.9% absolute ORC-WER

02

Supports speaker-attributed ASR by sequential transcript generation

03

Enables target speaker ASR with minimal modifications to existing models

Abstract

We propose a novel approach to enable the use of large, single-speaker ASR models, such as Whisper, for target speaker ASR. The key claim of this method is that it is much easier to model relative differences among speakers by learning to condition on frame-level diarization outputs than to learn the space of all speaker embeddings. We find that adding even a single bias term per diarization output type before the first transformer block can transform single-speaker ASR models into target-speaker ASR models. Our approach also supports speaker-attributed ASR by sequentially generating transcripts for each speaker in a diarization output. This simplified method outperforms baseline speech separation and diarization cascade by 12.9 % absolute ORC-WER on the NOTSOFAR-1 dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BUTSpeechFIT/TS-ASR-Whisper
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDispute Resolution and Class Actions