Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding

Mingyang Mao; Mariela M. Perez-Cabarcas; Utteja Kallakuri; Nicholas R. Waytowich; Xiaomin Lin; Tinoosh Mohsenin

arXiv:2505.23990·cs.AI·November 13, 2025

Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding

Mingyang Mao, Mariela M. Perez-Cabarcas, Utteja Kallakuri, Nicholas R. Waytowich, Xiaomin Lin, Tinoosh Mohsenin

PDF

Open Access

TL;DR

Multi-RAG is a multimodal system that enhances adaptive human assistance by integrating video, audio, and text understanding, outperforming existing models in efficiency and accuracy for dynamic, real-world scenarios.

Contribution

The paper introduces Multi-RAG, a novel multimodal retrieval-augmented generation system that improves situational understanding and reduces cognitive load in human-assistance tasks.

Findings

01

Outperforms existing Video-LLMs and LVLMs in benchmark tests.

02

Uses fewer resources and less input data than comparable models.

03

Demonstrates potential for practical human-robot assistance in real-world contexts.

Abstract

To effectively engage in human society, the ability to adapt, filter information, and make informed decisions in ever-changing situations is critical. As robots and intelligent agents become more integrated into human life, there is a growing opportunity-and need-to offload the cognitive burden on humans to these systems, particularly in dynamic, information-rich scenarios. To fill this critical need, we present Multi-RAG, a multimodal retrieval-augmented generation system designed to provide adaptive assistance to humans in information-intensive circumstances. Our system aims to improve situational understanding and reduce cognitive load by integrating and reasoning over multi-source information streams, including video, audio, and text. As an enabling step toward long-term human-robot partnerships, Multi-RAG explores how multimodal information understanding can serve as a foundation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Video Analysis and Summarization