DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models

Yangfu Li; Hongjian Zhan; Jiawei Chen; Yuning Gong; Qi Liu; Yue Lu

arXiv:2603.03857·cs.CV·March 5, 2026

DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models

Yangfu Li, Hongjian Zhan, Jiawei Chen, Yuning Gong, Qi Liu, Yue Lu

PDF

Open Access

TL;DR

DeepScan is a training-free framework that enhances large vision-language models' ability to perform visually grounded reasoning by localizing and aggregating evidence in a bottom-up manner, improving accuracy and interpretability.

Contribution

It introduces a novel, training-free approach combining hierarchical scanning, refocusing, and evidence reasoning to improve visual reasoning in LVLMs.

Findings

01

Achieves 90.6% accuracy on V* with Qwen2.5-VL-7B.

02

Significantly improves diverse LVLM architectures without extra training.

03

Enhances fine-grained visual understanding and interpretability.

Abstract

Humans can robustly localize visual evidence and provide grounded answers even in noisy environments by identifying critical cues and then relating them to the full context in a bottom-up manner. Inspired by this, we propose DeepScan, a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning for visually grounded reasoning in Large Vision-Language Models (LVLMs). Unlike existing methods that pursue one-shot localization of complete evidence, Hierarchical Scanning performs local cue exploration and multi-scale evidence extraction to recover evidence in a bottom-up manner, effectively mitigating the impacts of distractive context. Refocusing then optimizes the localized evidence view through collaboration of LVLMs and visual experts. Finally, Evidence-Enhanced Reasoning aggregates multi-granular views via a hybrid evidence memory and yields…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications