LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio
Naveen Vakada, Kartik Hegde, Arvind Krishna Sridhar, Yinyi Guo, and Erik Visser

TL;DR
LongAudio-RAG is a hybrid system that improves long-audio question answering by grounding language models in timestamped acoustic events, enabling precise, efficient, and scalable answers for multi-hour recordings.
Contribution
We introduce a novel hybrid framework that grounds LLM outputs in timestamped acoustic events stored in a database, enhancing long-audio question answering accuracy and efficiency.
Findings
Structured event retrieval outperforms vanilla RAG and text-to-SQL methods.
The system achieves low-latency event extraction on edge devices.
Our benchmark demonstrates effective question answering over multi-hour audio.
Abstract
Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models show promise, but long-audio question answering remains difficult due to context-length limits. We introduce LongAudio-RAG (LA-RAG), a hybrid framework that grounds Large Language Model outputs in retrieved, timestamped acoustic event detections rather than raw audio. Multi-hour streams are converted into structured event records stored in an SQL database, and at inference time the system resolves natural-language time references, classifies intent, retrieves only the relevant events, and generates answers using this constrained evidence. To evaluate performance, we construct a synthetic long-audio benchmark by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Topic Modeling
