Advancing Multi-talker ASR Performance with Large Language Models

Mohan Shi; Zengrui Jin; Yaoxun Xu; Yong Xu; Shi-Xiong Zhang; Kun Wei,; Yiwen Shao; Chunlei Zhang; Dong Yu

arXiv:2408.17431·eess.AS·September 2, 2024

Advancing Multi-talker ASR Performance with Large Language Models

Mohan Shi, Zengrui Jin, Yaoxun Xu, Yong Xu, Shi-Xiong Zhang, Kun Wei,, Yiwen Shao, Chunlei Zhang, Dong Yu

PDF

Open Access

TL;DR

This paper introduces an LLM-based serialized output training method for multi-talker ASR, leveraging pre-trained speech encoders and language models to improve recognition accuracy in overlapping speech scenarios.

Contribution

The paper proposes a novel LLM-based SOT approach that fine-tunes pre-trained models for multi-talker ASR, outperforming traditional AED methods and achieving state-of-the-art results.

Findings

01

Outperforms traditional AED-based methods on LibriMix

02

Achieves state-of-the-art on AMI dataset

03

Surpasses models trained with significantly more data

Abstract

Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR). Serialized output training (SOT) is a classic method to address multi-talker ASR, with the idea of concatenating transcriptions from multiple speakers according to the emission times of their speech for training. However, SOT-style transcriptions, derived from concatenating multiple related utterances in a conversation, depend significantly on modeling long contexts. Therefore, compared to traditional methods that primarily emphasize encoder performance in attention-based encoder-decoder (AED) architectures, a novel approach utilizing large language models (LLMs) that leverages the capabilities of pre-trained decoders may be better suited for such complex and challenging scenarios. In this paper, we propose an LLM-based SOT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems

MethodsSparse Evolutionary Training