Speech Retrieval-Augmented Generation without Automatic Speech Recognition
Do June Min, Karel Mundnich, Andy Lapastora, Erfan Soltanmohammadi,, Srikanth Ronanki, and Kyu Han

TL;DR
SpeechRAG introduces a speech retrieval-augmented generation framework that directly retrieves and generates answers from spoken data without relying on automatic speech recognition, reducing error propagation and improving performance.
Contribution
It presents a novel speech retrieval-augmented generation method that aligns speech and text embeddings, enabling direct audio retrieval and generation without ASR, outperforming cascaded systems.
Findings
Direct speech retrieval matches text-based baseline performance.
Speech retrieval outperforms cascaded systems with high WER.
Using a speech language model improves answer generation from audio.
Abstract
One common approach for question answering over speech data is to first transcribe speech using automatic speech recognition (ASR) and then employ text-based retrieval-augmented generation (RAG) on the transcriptions. While this cascaded pipeline has proven effective in many practical settings, ASR errors can propagate to the retrieval and generation steps. To overcome this limitation, we introduce SpeechRAG, a novel framework designed for open-question answering over spoken data. Our proposed approach fine-tunes a pre-trained speech encoder into a speech adapter fed into a frozen large language model (LLM)--based retrieval model. By aligning the embedding spaces of text and speech, our speech retriever directly retrieves audio passages from text-based queries, leveraging the retrieval capacity of the frozen text retriever. Our retrieval experiments on spoken question answering datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems
MethodsAdapter
