Adapting Multi-Lingual ASR Models for Handling Multiple Talkers
Chenda Li, Yao Qian, Zhuo Chen, Naoyuki Kanda, Dongmei Wang, Takuya, Yoshioka, Yanmin Qian, and Michael Zeng

TL;DR
This paper presents a method to adapt large-scale multilingual speech models for multi-talker ASR, enabling recognition of overlapped speech with timestamp prediction, while maintaining multilingual capabilities even with limited language data.
Contribution
The authors develop an enhanced serialized output training method combined with a lightweight adapter to adapt USMs for multi-talker recognition with timestamp prediction, preserving multilinguality.
Findings
Effective transfer of USMs to multi-talker ASR with timestamps
Maintains multilingual performance with limited adaptation data
Improves recognition accuracy on meeting conversation datasets
Abstract
State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages. However, it remains a challenge for these models to recognize overlapped speech, which is often seen in meeting conversations. We propose an approach to adapt USMs for multi-talker ASR. We first develop an enhanced version of serialized output training to jointly perform multi-talker ASR and utterance timestamp prediction. That is, we predict the ASR hypotheses for all speakers, count the speakers, and estimate the utterance timestamps at the same time. We further introduce a lightweight adapter module to maintain the multilingual property of the USMs even when we perform the adaptation with only a single language. Experimental results obtained using the AMI and AliMeeting corpora show that our proposed approach effectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
MethodsAdapter
