Unified Modeling of Multi-Talker Overlapped Speech Recognition and   Diarization with a Sidecar Separator

Lingwei Meng; Jiawen Kang; Mingyu Cui; Haibin Wu; Xixin Wu; Helen Meng

arXiv:2305.16263·cs.SD·May 26, 2023·1 cites

Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator

Lingwei Meng, Jiawen Kang, Mingyu Cui, Haibin Wu, Xixin Wu, Helen Meng

PDF

Open Access

TL;DR

This paper introduces a unified model combining speech recognition and diarization for overlapped multi-talker speech, using a Sidecar separator with minimal additional parameters, improving performance on multiple datasets.

Contribution

The study extends the Sidecar separator approach by adding a diarization branch, enabling joint modeling of ASR and diarization with negligible overhead and improved results.

Findings

01

Better ASR performance on LibriMix and LibriSpeechMix datasets.

02

Acceptable diarization results on CALLHOME with minimal adaptation.

03

Efficient joint modeling with only 768 extra parameters.

Abstract

Multi-talker overlapped speech poses a significant challenge for speech recognition and diarization. Recent research indicated that these two tasks are inter-dependent and complementary, motivating us to explore a unified modeling method to address them in the context of overlapped speech. A recent study proposed a cost-effective method to convert a single-talker automatic speech recognition (ASR) system into a multi-talker one, by inserting a Sidecar separator into the frozen well-trained ASR model. Extending on this, we incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters. The proposed method yields better ASR results compared to the baseline on LibriMix and LibriSpeechMix datasets. Moreover, without sophisticated customization on the diarization task, our method achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research