Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech   Recognition System

Lingwei Meng; Jiawen Kang; Yuejiao Wang; Zengrui Jin; Xixin Wu,; Xunying Liu; Helen Meng

arXiv:2407.09817·cs.SD·August 27, 2024

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System

Lingwei Meng, Jiawen Kang, Yuejiao Wang, Zengrui Jin, Xixin Wu,, Xunying Liu, Helen Meng

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel method to adapt the Whisper speech model for simultaneous multi-talker and target-talker speech recognition, achieving superior results on multiple datasets by integrating a separator, target identifier, and soft prompt tuning.

Contribution

It presents a pioneering approach to enable Whisper to perform joint multi-talker and target-talker recognition, combining several techniques for improved performance.

Findings

01

Outperforms previous methods on LibriMix and LibriSpeechMix datasets.

02

Achieves acceptable zero-shot performance on AishellMix Mandarin dataset.

03

Demonstrates the effectiveness of the combined approach for complex speech recognition tasks.

Abstract

Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks. Specifically, (i) we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers; (ii) a Target Talker Identifier is introduced to identify the embedding flow of the target talker on the fly, requiring only three-second enrollment speech as a cue; (iii) soft prompt tuning for decoder is explored for better task adaptation. Our method outperforms previous methods on two- and three-talker LibriMix and LibriSpeechMix datasets for both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LingweiMeng/Whisper-Sidecar
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Educational Reforms and Innovations