SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

Pengcheng Guo; Xuankai Chang; Hang Lv; Shinji Watanabe; Lei Xie

arXiv:2412.05589·eess.AS·December 10, 2024·2 cites

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

Pengcheng Guo, Xuankai Chang, Hang Lv, Shinji Watanabe, Lei Xie

PDF

Open Access 1 Repo

TL;DR

This paper introduces SQ-Whisper, a novel target-speaker ASR model that adapts the Whisper foundation model with trainable speaker queries, significantly improving multi-speaker speech recognition accuracy.

Contribution

The paper proposes SQ-Whisper, a new method using trainable queries for target-speaker recognition, enhancing Whisper's ability to handle overlapping speech in real-world scenarios.

Findings

01

Achieved up to 15% and 10% relative WER reductions on Libri2Mix and WSJ0-2Mix datasets.

02

Established new state-of-the-art WERs of 14.6% and 4.4% with data augmentation.

03

Demonstrated effective adaptation on real-world AMI meeting data.

Abstract

Benefiting from massive and diverse data sources, speech foundation models exhibit strong generalization and knowledge transfer capabilities to a wide range of downstream tasks. However, a limitation arises from their exclusive handling of single-speaker speech input, making them ineffective in recognizing multi-speaker overlapped speech, a common occurrence in real-world scenarios. In this study, we delve into the adaptation of speech foundation models to eliminate interfering speakers from overlapping speech and perform target-speaker automatic speech recognition (TS-ASR). Initially, we utilize the Whisper model as the foundation for adaptation and conduct a thorough comparison of its integration with existing target-speaker adaptation techniques. We then propose an innovative model termed Speaker-Querying Whisper (SQ-Whisper), which employs a set number of trainable queries to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pengchengguo/espnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems

MethodsSparse Evolutionary Training