LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio

Naveen Vakada; Kartik Hegde; Arvind Krishna Sridhar; Yinyi Guo; and Erik Visser

arXiv:2602.14612·eess.AS·March 10, 2026

LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio

Naveen Vakada, Kartik Hegde, Arvind Krishna Sridhar, Yinyi Guo, and Erik Visser

PDF

Open Access

TL;DR

LongAudio-RAG is a hybrid system that improves long-audio question answering by grounding language models in timestamped acoustic events, enabling precise, efficient, and scalable answers for multi-hour recordings.

Contribution

We introduce a novel hybrid framework that grounds LLM outputs in timestamped acoustic events stored in a database, enhancing long-audio question answering accuracy and efficiency.

Findings

01

Structured event retrieval outperforms vanilla RAG and text-to-SQL methods.

02

The system achieves low-latency event extraction on edge devices.

03

Our benchmark demonstrates effective question answering over multi-hour audio.

Abstract

Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models show promise, but long-audio question answering remains difficult due to context-length limits. We introduce LongAudio-RAG (LA-RAG), a hybrid framework that grounds Large Language Model outputs in retrieved, timestamped acoustic event detections rather than raw audio. Multi-hour streams are converted into structured event records stored in an SQL database, and at inference time the system resolves natural-language time references, classifies intent, retrieves only the relevant events, and generates answers using this constrained evidence. To evaluate performance, we construct a synthetic long-audio benchmark by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Topic Modeling