A Sidecar Separator Can Convert a Single-Talker Speech Recognition   System to a Multi-Talker One

Lingwei Meng; Jiawen Kang; Mingyu Cui; Yuejiao Wang; Xixin Wu; Helen; Meng

arXiv:2302.09908·cs.SD·March 7, 2023·1 cites

A Sidecar Separator Can Convert a Single-Talker Speech Recognition System to a Multi-Talker One

Lingwei Meng, Jiawen Kang, Mingyu Cui, Yuejiao Wang, Xixin Wu, Helen, Meng

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Sidecar separator that enhances a single-talker ASR system to effectively recognize overlapping speech from multiple talkers by leveraging layer-specific speech embeddings.

Contribution

The paper proposes a novel Sidecar separator approach that, when added to a pre-trained ASR model, significantly improves multi-talker speech recognition performance.

Findings

01

Achieves 10.36% WER on LibriMix 2-speaker dataset

02

Outperforms previous state-of-the-art results

03

Maintains comparable performance with limited training data

Abstract

Although automatic speech recognition (ASR) can perform well in common non-overlapping environments, sustaining performance in multi-talker overlapping speech recognition remains challenging. Recent research revealed that ASR model's encoder captures different levels of information with different layers -- the lower layers tend to have more acoustic information, and the upper layers more linguistic. This inspires us to develop a Sidecar separator to empower a well-trained ASR model for multi-talker scenarios by separating the mixed speech embedding between two suitable layers. We experimented with a wav2vec 2.0-based ASR model with a Sidecar mounted. By freezing the parameters of the original model and training only the Sidecar (8.7 M, 8.4% of all parameters), the proposed approach outperforms the previous state-of-the-art by a large margin for the 2-speaker mixed LibriMix dataset,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LingweiMeng/Whisper-Sidecar
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing