PlanRAG-Audio: Planning and Retrieval Augmented Generation for Long-form Audio Understanding
Masao, Someki, Chien-yu, Huang, Siddhant, Arora, Samuele, Cornell, Markus, M\"uller, Nathan, Susanj, Rupak V, Swaminathan, Grant P, Strimel, Jing, Liu, Shinji, Watanabe

TL;DR
PlanRAG-Audio introduces a planning-based retrieval-augmented framework that enhances long-form audio understanding by selectively retrieving relevant information, improving reasoning accuracy and scalability.
Contribution
It presents a novel planning and retrieval approach that enables large audio models to handle long recordings efficiently by focusing on relevant modalities and segments.
Findings
Improves reasoning accuracy on complex audio queries.
Stabilizes performance as audio duration increases.
Reduces input length for large language models.
Abstract
Long-form audio understanding poses significant challenges for large audio language models (LALMs) due to the extreme length of audio sequences and the need to reason over heterogeneous acoustic cues distributed over time, such as speech content, speaker identity, emotion, and sound events. To address these challenges, we propose \textbf{PlanRAG-Audio}, a planning-based retrieval-augmented generation framework for scalable long-form audio understanding. Rather than having audio LALMs process entire recordings directly, PlanRAG-Audio explicitly plans which modalities and temporal spans are required for a given query, and retrieves only query-relevant information from a structured text and audio database. This retrieval planning enables effective reasoning over complex, cross-domain audio queries while substantially reducing the input length passed to the large language models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
