Speech Retrieval-Augmented Generation without Automatic Speech   Recognition

Do June Min; Karel Mundnich; Andy Lapastora; Erfan Soltanmohammadi,; Srikanth Ronanki; and Kyu Han

arXiv:2412.16500·eess.AS·January 6, 2025

Speech Retrieval-Augmented Generation without Automatic Speech Recognition

Do June Min, Karel Mundnich, Andy Lapastora, Erfan Soltanmohammadi,, Srikanth Ronanki, and Kyu Han

PDF

Open Access

TL;DR

SpeechRAG introduces a speech retrieval-augmented generation framework that directly retrieves and generates answers from spoken data without relying on automatic speech recognition, reducing error propagation and improving performance.

Contribution

It presents a novel speech retrieval-augmented generation method that aligns speech and text embeddings, enabling direct audio retrieval and generation without ASR, outperforming cascaded systems.

Findings

01

Direct speech retrieval matches text-based baseline performance.

02

Speech retrieval outperforms cascaded systems with high WER.

03

Using a speech language model improves answer generation from audio.

Abstract

One common approach for question answering over speech data is to first transcribe speech using automatic speech recognition (ASR) and then employ text-based retrieval-augmented generation (RAG) on the transcriptions. While this cascaded pipeline has proven effective in many practical settings, ASR errors can propagate to the retrieval and generation steps. To overcome this limitation, we introduce SpeechRAG, a novel framework designed for open-question answering over spoken data. Our proposed approach fine-tunes a pre-trained speech encoder into a speech adapter fed into a frozen large language model (LLM)--based retrieval model. By aligning the embedding spaces of text and speech, our speech retriever directly retrieves audio passages from text-based queries, leveraging the retrieval capacity of the frozen text retriever. Our retrieval experiments on spoken question answering datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems

MethodsAdapter