SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR
Pengcheng Guo, Xuankai Chang, Hang Lv, Shinji Watanabe, Lei Xie

TL;DR
This paper introduces SQ-Whisper, a novel target-speaker ASR model that adapts the Whisper foundation model with trainable speaker queries, significantly improving multi-speaker speech recognition accuracy.
Contribution
The paper proposes SQ-Whisper, a new method using trainable queries for target-speaker recognition, enhancing Whisper's ability to handle overlapping speech in real-world scenarios.
Findings
Achieved up to 15% and 10% relative WER reductions on Libri2Mix and WSJ0-2Mix datasets.
Established new state-of-the-art WERs of 14.6% and 4.4% with data augmentation.
Demonstrated effective adaptation on real-world AMI meeting data.
Abstract
Benefiting from massive and diverse data sources, speech foundation models exhibit strong generalization and knowledge transfer capabilities to a wide range of downstream tasks. However, a limitation arises from their exclusive handling of single-speaker speech input, making them ineffective in recognizing multi-speaker overlapped speech, a common occurrence in real-world scenarios. In this study, we delve into the adaptation of speech foundation models to eliminate interfering speakers from overlapping speech and perform target-speaker automatic speech recognition (TS-ASR). Initially, we utilize the Whisper model as the foundation for adaptation and conduct a thorough comparison of its integration with existing target-speaker adaptation techniques. We then propose an innovative model termed Speaker-Querying Whisper (SQ-Whisper), which employs a set number of trainable queries to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems
MethodsSparse Evolutionary Training
