Multimodal Adaptive Retrieval Augmented Generation through Internal Representation Learning

Ruoshuang Du; Xin Sun; Qiang Liu; Bowen Song; Zhongqi Chen; Weiqiang Wang; Liang Wang

arXiv:2603.00511·cs.CV·March 3, 2026

Multimodal Adaptive Retrieval Augmented Generation through Internal Representation Learning

Ruoshuang Du, Xin Sun, Qiang Liu, Bowen Song, Zhongqi Chen, Weiqiang Wang, Liang Wang

PDF

Open Access

TL;DR

This paper introduces MMA-RAG, a novel multimodal retrieval-augmented generation method that dynamically assesses internal model confidence to improve visual question answering accuracy and reliability.

Contribution

The paper presents MMA-RAG, a new adaptive retrieval framework that uses internal representations to decide when to incorporate external knowledge, enhancing multimodal VQA performance.

Findings

01

Significant performance improvements on three VQA datasets.

02

Internal representations are crucial for adaptive retrieval decisions.

03

MMA-RAG balances external knowledge use and inference robustness.

Abstract

Visual Question Answering systems face reliability issues due to hallucinations, where models generate answers misaligned with visual input or factual knowledge. While Retrieval Augmented Generation frameworks mitigate this issue by incorporating external knowledge, static retrieval often introduces irrelevant or conflicting content, particularly in visual RAG settings where visually similar but semantically incorrect evidence may be retrieved. To address this, we propose Multimodal Adaptive RAG (MMA-RAG), which dynamically assesses the confidence in the internal knowledge of the model to decide whether to incorporate the retrieved external information into the generation process. Central to MMA-RAG is a decision classifier trained through a layer-wise analysis, which leverages joint internal visual and textual representations to guide the use of reverse image retrieval. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning