Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System
Lingwei Meng, Jiawen Kang, Yuejiao Wang, Zengrui Jin, Xixin Wu,, Xunying Liu, Helen Meng

TL;DR
This paper introduces a novel method to adapt the Whisper speech model for simultaneous multi-talker and target-talker speech recognition, achieving superior results on multiple datasets by integrating a separator, target identifier, and soft prompt tuning.
Contribution
It presents a pioneering approach to enable Whisper to perform joint multi-talker and target-talker recognition, combining several techniques for improved performance.
Findings
Outperforms previous methods on LibriMix and LibriSpeechMix datasets.
Achieves acceptable zero-shot performance on AishellMix Mandarin dataset.
Demonstrates the effectiveness of the combined approach for complex speech recognition tasks.
Abstract
Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks. Specifically, (i) we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers; (ii) a Target Talker Identifier is introduced to identify the embedding flow of the target talker on the fly, requiring only three-second enrollment speech as a cue; (iii) soft prompt tuning for decoder is explored for better task adaptation. Our method outperforms previous methods on two- and three-talker LibriMix and LibriSpeechMix datasets for both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Educational Reforms and Innovations
