RAG-Boost: Retrieval-Augmented Generation Enhanced LLM-based Speech Recognition
Pengcheng Wang, Sheng Li, Takahiro Shinozaki

TL;DR
This paper introduces RAG-Boost, a system that enhances speech recognition accuracy by integrating retrieval-augmented generation with large language models, enabling on-the-fly correction of recognition errors.
Contribution
The paper presents a novel retrieval-augmented approach for LLM-based speech recognition that improves accuracy by dynamically incorporating relevant audio-text data during decoding.
Findings
Improved speech recognition accuracy on the MLC-SLM Challenge
Effective on-the-fly correction of recognition errors
Integration of retrieval-augmented generation with LLMs enhances performance
Abstract
In this paper, we propose RAG-Boost (ST-ShinozakiLab Task I system), which enhances the baseline LLM-based ASR system of the MLC-SLM Challenge (task I) with a retrieval-augmented generation (RAG) module on the fly. Each partial ASR hypothesis queries a vector store of audio-text pairs and domain terms, and the retrieved results are fused with the live ASR hypotheses to fix recognition errors. The fused hypotheses are passed to the LLM, yielding improved responses.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
