VoxRAG: A Step Toward Transcription-Free RAG Systems in Spoken Question Answering

Zackary Rackauckas; Julia Hirschberg

arXiv:2505.17326·cs.IR·August 8, 2025

VoxRAG: A Step Toward Transcription-Free RAG Systems in Spoken Question Answering

Zackary Rackauckas, Julia Hirschberg

PDF

1 Video

TL;DR

VoxRAG introduces a novel speech-to-speech retrieval-augmented generation system that directly retrieves relevant audio segments from spoken queries without transcription, demonstrating promising retrieval quality and answer relevance.

Contribution

This work presents VoxRAG, a modular system that bypasses transcription in spoken question answering, utilizing speech-based retrieval with promising initial results.

Findings

01

Recall@10 for very relevant segments is 0.34

02

Recall@10 for somewhat relevant segments is 0.60

03

Mean answer scores indicate moderate accuracy and completeness

Abstract

We introduce VoxRAG, a modular speech-to-speech retrieval-augmented generation system that bypasses transcription to retrieve semantically relevant audio segments directly from spoken queries. VoxRAG employs silence-aware segmentation, speaker diarization, CLAP audio embeddings, and FAISS retrieval using L2-normalized cosine similarity. We construct a 50-query test set recorded as spoken input by a native English speaker. Retrieval quality was evaluated using LLM-as-a-judge annotations. For very relevant segments, cosine similarity achieved a Recall@10 of 0.34. For somewhat relevant segments, Recall@10 rose to 0.60 and nDCG@10 to 0.27, highlighting strong topical alignment. Answer quality was judged on a 0--2 scale across relevance, accuracy, completeness, and precision, with mean scores of 0.84, 0.58, 0.56, and 0.46 respectively. While precision and retrieval quality remain key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VoxRAG: A Step Toward Transcription-Free RAG Systems in Spoken Question Answering· underline

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Attention Dropout · Softmax · WordPiece · Weight Decay · Multi-Head Attention · Layer Normalization · Byte Pair Encoding