Improving Contextual ASR via Multi-grained Fusion with Large Language Models

Shilin Zhou; Zhenghua Li

arXiv:2507.12252·cs.CL·July 17, 2025

Improving Contextual ASR via Multi-grained Fusion with Large Language Models

Shilin Zhou, Zhenghua Li

PDF

Open Access

TL;DR

This paper introduces a multi-grained fusion method combining token-level and phrase-level techniques with Large Language Models to improve contextual keyword recognition in end-to-end ASR systems, achieving state-of-the-art results.

Contribution

It proposes a novel late-fusion approach that jointly leverages token and phrase-level information with LLMs for enhanced ASR performance.

Findings

01

Achieves state-of-the-art keyword recognition metrics on Chinese and English datasets.

02

Both token-level and phrase-level components significantly improve performance.

03

The joint multi-grained framework balances fine-grained accuracy with holistic understanding.

Abstract

While end-to-end Automatic Speech Recognition (ASR) models have shown impressive performance in transcribing general speech, they often struggle to accurately recognize contextually relevant keywords, such as proper nouns or user-specific entities. Previous approaches have explored leveraging keyword dictionaries in the textual modality to improve keyword recognition, either through token-level fusion that guides token-by-token generation or phrase-level fusion that enables direct copying of keyword phrases. However, these methods operate at different granularities and have their own limitations. In this paper, we propose a novel multi-grained fusion approach that jointly leverages the strengths of both token-level and phrase-level fusion with Large Language Models (LLMs). Our approach incorporates a late-fusion strategy that elegantly combines ASR's acoustic information with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling