Multimodal Adaptive Retrieval Augmented Generation through Internal Representation Learning
Ruoshuang Du, Xin Sun, Qiang Liu, Bowen Song, Zhongqi Chen, Weiqiang Wang, Liang Wang

TL;DR
This paper introduces MMA-RAG, a novel multimodal retrieval-augmented generation method that dynamically assesses internal model confidence to improve visual question answering accuracy and reliability.
Contribution
The paper presents MMA-RAG, a new adaptive retrieval framework that uses internal representations to decide when to incorporate external knowledge, enhancing multimodal VQA performance.
Findings
Significant performance improvements on three VQA datasets.
Internal representations are crucial for adaptive retrieval decisions.
MMA-RAG balances external knowledge use and inference robustness.
Abstract
Visual Question Answering systems face reliability issues due to hallucinations, where models generate answers misaligned with visual input or factual knowledge. While Retrieval Augmented Generation frameworks mitigate this issue by incorporating external knowledge, static retrieval often introduces irrelevant or conflicting content, particularly in visual RAG settings where visually similar but semantically incorrect evidence may be retrieved. To address this, we propose Multimodal Adaptive RAG (MMA-RAG), which dynamically assesses the confidence in the internal knowledge of the model to decide whether to incorporate the retrieved external information into the generation process. Central to MMA-RAG is a decision classifier trained through a layer-wise analysis, which leverages joint internal visual and textual representations to guide the use of reverse image retrieval. Experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
