DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models
Yangfu Li, Hongjian Zhan, Jiawei Chen, Yuning Gong, Qi Liu, Yue Lu

TL;DR
DeepScan is a training-free framework that enhances large vision-language models' ability to perform visually grounded reasoning by localizing and aggregating evidence in a bottom-up manner, improving accuracy and interpretability.
Contribution
It introduces a novel, training-free approach combining hierarchical scanning, refocusing, and evidence reasoning to improve visual reasoning in LVLMs.
Findings
Achieves 90.6% accuracy on V* with Qwen2.5-VL-7B.
Significantly improves diverse LVLM architectures without extra training.
Enhances fine-grained visual understanding and interpretability.
Abstract
Humans can robustly localize visual evidence and provide grounded answers even in noisy environments by identifying critical cues and then relating them to the full context in a bottom-up manner. Inspired by this, we propose DeepScan, a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning for visually grounded reasoning in Large Vision-Language Models (LVLMs). Unlike existing methods that pursue one-shot localization of complete evidence, Hierarchical Scanning performs local cue exploration and multi-scale evidence extraction to recover evidence in a bottom-up manner, effectively mitigating the impacts of distractive context. Refocusing then optimizes the localized evidence view through collaboration of LVLMs and visual experts. Finally, Evidence-Enhanced Reasoning aggregates multi-granular views via a hybrid evidence memory and yields…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
