RAG-Boost: Retrieval-Augmented Generation Enhanced LLM-based Speech Recognition

Pengcheng Wang; Sheng Li; Takahiro Shinozaki

arXiv:2508.14048·eess.AS·August 21, 2025

RAG-Boost: Retrieval-Augmented Generation Enhanced LLM-based Speech Recognition

Pengcheng Wang, Sheng Li, Takahiro Shinozaki

PDF

Open Access

TL;DR

This paper introduces RAG-Boost, a system that enhances speech recognition accuracy by integrating retrieval-augmented generation with large language models, enabling on-the-fly correction of recognition errors.

Contribution

The paper presents a novel retrieval-augmented approach for LLM-based speech recognition that improves accuracy by dynamically incorporating relevant audio-text data during decoding.

Findings

01

Improved speech recognition accuracy on the MLC-SLM Challenge

02

Effective on-the-fly correction of recognition errors

03

Integration of retrieval-augmented generation with LLMs enhances performance

Abstract

In this paper, we propose RAG-Boost (ST-ShinozakiLab Task I system), which enhances the baseline LLM-based ASR system of the MLC-SLM Challenge (task I) with a retrieval-augmented generation (RAG) module on the fly. Each partial ASR hypothesis queries a vector store of audio-text pairs and domain terms, and the retrieved results are fused with the live ASR hypotheses to fix recognition errors. The fused hypotheses are passed to the LLM, yielding improved responses.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis