mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Zhongang Qi, Chunfeng, Yuan, Bing Li, Junfu Pu, Yuxuan Zhao, Zehua Xie, Jin Ma, Ying Shan, Weiming, Hu

TL;DR
The paper introduces mR$^2$AG, a novel framework that enhances multimodal large language models for knowledge-based VQA by enabling adaptive retrieval and evidence localization through reflection operations, improving accuracy and efficiency.
Contribution
The paper proposes mR$^2$AG, a generalized retrieval-reflection-augmented generation framework that addresses limitations of existing methods, with adaptive retrieval, evidence localization, and easy integration into existing models.
Findings
Outperforms state-of-the-art MLLMs on INFOSEEK and Encyclopedic-VQA.
Reduces unnecessary retrieval calls and model complexity.
Maintains strong performance across various visual tasks.
Abstract
Advanced Multimodal Large Language Models (MLLMs) struggle with recent Knowledge-based VQA tasks, such as INFOSEEK and Encyclopedic-VQA, due to their limited and frozen knowledge scope, often leading to ambiguous and inaccurate responses. Thus, multimodal Retrieval-Augmented Generation (mRAG) is naturally introduced to provide MLLMs with comprehensive and up-to-date knowledge, effectively expanding the knowledge scope. However, current mRAG methods have inherent drawbacks, including: 1) Performing retrieval even when external knowledge is not needed. 2) Lacking of identification of evidence that supports the query. 3) Increasing model complexity due to additional information filtering modules or rules. To address these shortcomings, we propose a novel generalized framework called \textbf{m}ultimodal \textbf{R}etrieval-\textbf{R}eflection-\textbf{A}ugmented \textbf{G}eneration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Web Data Mining and Analysis
MethodsBalanced Selection
