OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering

Yifan Zhu; Xinyu Mu; Tao Feng; Zhonghong Ou; Yuning Gong; Haoran Luo

arXiv:2602.03707·cs.CL·March 31, 2026

OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering

Yifan Zhu, Xinyu Mu, Tao Feng, Zhonghong Ou, Yuning Gong, Haoran Luo

PDF

1 Datasets

TL;DR

OmniRAG-Agent introduces an agentic, retrieval-augmented approach for low-resource long audio-video question answering, enabling efficient reasoning over multiple modalities with improved accuracy.

Contribution

It presents a novel agentic omnimodal QA framework with retrieval, planning, and optimization components tailored for low-resource long audio-video reasoning.

Findings

01

Outperforms prior methods in low-resource settings on multiple benchmarks.

02

Effective retrieval of relevant frames and audio snippets improves answer accuracy.

03

Ablation studies validate the contribution of each component.

Abstract

Long-horizon omnimodal question answering answers questions by reasoning over text, images, audio, and video. Despite recent progress on OmniLLMs, low-resource long audio-video QA still suffers from costly dense encoding, weak fine-grained retrieval, limited proactive planning, and no clear end-to-end optimization. To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning. It builds an image-audio retrieval-augmented generation module that lets an OmniLLM fetch short, relevant frames and audio snippets from external banks. Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries. Furthermore, we apply group relative policy optimization to jointly improve tool use and answer quality over time. Experiments on OmniVideoBench, WorldSense, and Daily-Omni show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

JackMuX3Y/OmniRAG-Agent
dataset· 565 dl
565 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.