TL;DR
R3G is a modular framework that enhances vision-centric answer generation in VQA by combining reasoning, retrieval, and reranking to select and utilize visual evidence effectively, achieving state-of-the-art results.
Contribution
The paper introduces R3G, a novel reasoning-retrieval-reranking framework that improves image selection and integration in vision-based question answering models.
Findings
R3G improves accuracy across multiple backbones and scenarios.
Sufficiency-aware reranking and reasoning are complementary.
Achieves state-of-the-art performance on MRAG-Bench.
Abstract
Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process. However, selecting the right images and integrating them effectively into the model's reasoning remains challenging.To address this challenge, we propose R3G, a modular Reasoning-Retrieval-Reranking framework.It first produces a brief reasoning plan that specifies the required visual cues, then adopts a two-stage strategy, with coarse retrieval followed by fine-grained reranking, to select evidence images.On MRAG-Bench, R3G improves accuracy across six MLLM backbones and nine sub-scenarios, achieving state-of-the-art overall performance. Ablations show that sufficiency-aware reranking and reasoning steps are complementary, helping the model both choose the right images and use them well. We release code and data at https://github.com/czh24/R3G.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
