TL;DR
ReAG is a novel multimodal retrieval-augmented model that improves knowledge-based visual question answering by combining multi-stage retrieval, a critic filter, and reinforcement learning for better reasoning and accuracy.
Contribution
It introduces ReAG, a reasoning-augmented retrieval approach that enhances knowledge-based VQA through multi-stage retrieval, filtering, and reinforcement learning-based training.
Findings
ReAG outperforms prior methods on Encyclopedic-VQA and InfoSeek datasets.
ReAG improves answer accuracy and interpretability by grounding responses in retrieved evidence.
The multi-stage retrieval and critic filtering significantly reduce irrelevant information in answers.
Abstract
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
