CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models

Jiajun He; Naoki Sawada; Koichi Miyazaki; Tomoki Toda

arXiv:2506.12059·eess.AS·June 17, 2025

CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models

Jiajun He, Naoki Sawada, Koichi Miyazaki, Tomoki Toda

PDF

Open Access

TL;DR

This paper introduces a unified ASR framework that combines multi-talker speech recognition and contextual biasing using pretrained speech encoders and large language models, improving recognition accuracy in complex scenarios.

Contribution

It presents a novel integrated approach that merges multi-talker recognition with contextual biasing, including a two-stage filtering algorithm for rare word incorporation.

Findings

01

Achieves 7.9% WER on LibriMix

02

Achieves 32.9% WER on AMI SDM

03

Outperforms traditional biasing methods

Abstract

In real-world applications, automatic speech recognition (ASR) systems must handle overlapping speech from multiple speakers and recognize rare words like technical terms. Traditional methods address multi-talker ASR and contextual biasing separately, limiting performance in complex scenarios. We propose a unified framework that combines multi-talker overlapping speech recognition and contextual biasing into a single task. Our ASR method integrates pretrained speech encoders and large language models (LLMs), using optimized finetuning strategies. We also introduce a two-stage filtering algorithm to efficiently identify relevant rare words from large biasing lists and incorporate them into the LLM's prompt input, enhancing rare word recognition. Experiments show that our approach outperforms traditional contextual biasing methods, achieving a WER of 7.9% on LibriMix and 32.9% on AMI SDM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis