Improving Contextual ASR via Multi-grained Fusion with Large Language Models
Shilin Zhou, Zhenghua Li

TL;DR
This paper introduces a multi-grained fusion method combining token-level and phrase-level techniques with Large Language Models to improve contextual keyword recognition in end-to-end ASR systems, achieving state-of-the-art results.
Contribution
It proposes a novel late-fusion approach that jointly leverages token and phrase-level information with LLMs for enhanced ASR performance.
Findings
Achieves state-of-the-art keyword recognition metrics on Chinese and English datasets.
Both token-level and phrase-level components significantly improve performance.
The joint multi-grained framework balances fine-grained accuracy with holistic understanding.
Abstract
While end-to-end Automatic Speech Recognition (ASR) models have shown impressive performance in transcribing general speech, they often struggle to accurately recognize contextually relevant keywords, such as proper nouns or user-specific entities. Previous approaches have explored leveraging keyword dictionaries in the textual modality to improve keyword recognition, either through token-level fusion that guides token-by-token generation or phrase-level fusion that enables direct copying of keyword phrases. However, these methods operate at different granularities and have their own limitations. In this paper, we propose a novel multi-grained fusion approach that jointly leverages the strengths of both token-level and phrase-level fusion with Large Language Models (LLMs). Our approach incorporates a late-fusion strategy that elegantly combines ASR's acoustic information with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling
