CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models
Jiajun He, Naoki Sawada, Koichi Miyazaki, Tomoki Toda

TL;DR
This paper introduces a unified ASR framework that combines multi-talker speech recognition and contextual biasing using pretrained speech encoders and large language models, improving recognition accuracy in complex scenarios.
Contribution
It presents a novel integrated approach that merges multi-talker recognition with contextual biasing, including a two-stage filtering algorithm for rare word incorporation.
Findings
Achieves 7.9% WER on LibriMix
Achieves 32.9% WER on AMI SDM
Outperforms traditional biasing methods
Abstract
In real-world applications, automatic speech recognition (ASR) systems must handle overlapping speech from multiple speakers and recognize rare words like technical terms. Traditional methods address multi-talker ASR and contextual biasing separately, limiting performance in complex scenarios. We propose a unified framework that combines multi-talker overlapping speech recognition and contextual biasing into a single task. Our ASR method integrates pretrained speech encoders and large language models (LLMs), using optimized finetuning strategies. We also introduce a two-stage filtering algorithm to efficiently identify relevant rare words from large biasing lists and incorporate them into the LLM's prompt input, enhancing rare word recognition. Experiments show that our approach outperforms traditional contextual biasing methods, achieving a WER of 7.9% on LibriMix and 32.9% on AMI SDM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
