Retrieval-Augmented Speech Recognition Approach for Domain Challenges
Peng Shen, Xugang Lu, Hisashi Kawai

TL;DR
This paper presents a retrieval-augmented speech recognition method that leverages domain-specific textual data at inference to improve accuracy in domain mismatch scenarios, inspired by RAG techniques for LLMs.
Contribution
It introduces a novel LLM-based retrieval-augmented approach that uses domain-specific data during inference, not training, to enhance speech recognition performance.
Findings
Significantly improves recognition accuracy on CSJ dataset.
Achieves state-of-the-art results without full training data.
Efficiently accesses domain-specific documents during inference.
Abstract
Speech recognition systems often face challenges due to domain mismatch, particularly in real-world applications where domain-specific data is unavailable because of data accessibility and confidentiality constraints. Inspired by Retrieval-Augmented Generation (RAG) techniques for large language models (LLMs), this paper introduces a LLM-based retrieval-augmented speech recognition method that incorporates domain-specific textual data at the inference stage to enhance recognition performance. Rather than relying on domain-specific textual data during the training phase, our model is trained to learn how to utilize textual information provided in prompts for LLM decoder to improve speech recognition performance. Benefiting from the advantages of the RAG retrieval mechanism, our approach efficiently accesses locally available domain-specific documents, ensuring a convenient and effective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · Linear Layer · Layer Normalization · Byte Pair Encoding · WordPiece · Dense Connections · Attention Dropout · Residual Connection
