Improving Visual Reasoning with Iterative Evidence Refinement
Zeru Shi, Kai Mei, Yihao Quan, Dimitris N.Metaxas, Ruixiang Tang

TL;DR
This paper introduces SIEVE, an end-to-end framework that enhances visual reasoning by internally re-engaging visual evidence through learned embeddings, eliminating the need for external image operations and improving accuracy.
Contribution
SIEVE leverages internal representations for iterative evidence refinement in visual reasoning, avoiding external image manipulations and using reinforcement learning to control the revisiting process.
Findings
SIEVE improves performance by 8% on average across benchmarks.
It enables models to re-engage visual evidence without external image re-encoding.
Experiments show consistent gains in reasoning accuracy.
Abstract
Vision language models (VLMs) are increasingly capable of reasoning over images, but robust visual reasoning often requires re-grounding intermediate steps in the underlying visual evidence. Recent approaches typically rely on external image operations such as zooming or cropping to re-access fine-grained details during inference, which requires additional image re-encoding and can disrupt the reasoning trajectory. We argue that VLMs already provide strong internal signals for identifying and reusing visual evidence, and that these signals can be directly leveraged to support image-grounded reasoning. Motivated by this insight, we propose an end-to-end self-revisit framework, SIEVE, that trains models to re-engage image evidence through internal representations. SIEVE automatically extracts embeddings of salient image regions and injects them into the reasoning chain when additional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
