Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

Chenda Li; Yao Qian; Zhuo Chen; Naoyuki Kanda; Dongmei Wang; Takuya; Yoshioka; Yanmin Qian; and Michael Zeng

arXiv:2305.18747·eess.AS·May 31, 2023·1 cites

Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

Chenda Li, Yao Qian, Zhuo Chen, Naoyuki Kanda, Dongmei Wang, Takuya, Yoshioka, Yanmin Qian, and Michael Zeng

PDF

Open Access

TL;DR

This paper presents a method to adapt large-scale multilingual speech models for multi-talker ASR, enabling recognition of overlapped speech with timestamp prediction, while maintaining multilingual capabilities even with limited language data.

Contribution

The authors develop an enhanced serialized output training method combined with a lightweight adapter to adapt USMs for multi-talker recognition with timestamp prediction, preserving multilinguality.

Findings

01

Effective transfer of USMs to multi-talker ASR with timestamps

02

Maintains multilingual performance with limited adaptation data

03

Improves recognition accuracy on meeting conversation datasets

Abstract

State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages. However, it remains a challenge for these models to recognize overlapped speech, which is often seen in meeting conversations. We propose an approach to adapt USMs for multi-talker ASR. We first develop an enhanced version of serialized output training to jointly perform multi-talker ASR and utterance timestamp prediction. That is, we predict the ASR hypotheses for all speakers, count the speakers, and estimate the utterance timestamps at the same time. We further introduce a lightweight adapter module to maintain the multilingual property of the USMs even when we perform the adaptation with only a single language. Experimental results obtained using the AMI and AliMeeting corpora show that our proposed approach effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

MethodsAdapter