Advancing Multi-talker ASR Performance with Large Language Models
Mohan Shi, Zengrui Jin, Yaoxun Xu, Yong Xu, Shi-Xiong Zhang, Kun Wei,, Yiwen Shao, Chunlei Zhang, Dong Yu

TL;DR
This paper introduces an LLM-based serialized output training method for multi-talker ASR, leveraging pre-trained speech encoders and language models to improve recognition accuracy in overlapping speech scenarios.
Contribution
The paper proposes a novel LLM-based SOT approach that fine-tunes pre-trained models for multi-talker ASR, outperforming traditional AED methods and achieving state-of-the-art results.
Findings
Outperforms traditional AED-based methods on LibriMix
Achieves state-of-the-art on AMI dataset
Surpasses models trained with significantly more data
Abstract
Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR). Serialized output training (SOT) is a classic method to address multi-talker ASR, with the idea of concatenating transcriptions from multiple speakers according to the emission times of their speech for training. However, SOT-style transcriptions, derived from concatenating multiple related utterances in a conversation, depend significantly on modeling long contexts. Therefore, compared to traditional methods that primarily emphasize encoder performance in attention-based encoder-decoder (AED) architectures, a novel approach utilizing large language models (LLMs) that leverages the capabilities of pre-trained decoders may be better suited for such complex and challenging scenarios. In this paper, we propose an LLM-based SOT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems
MethodsSparse Evolutionary Training
