Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement
Zipeng Zhu, Zhanghao Hu, Qinglin Zhu, Yuxi Hong, Yijun Liu, Jingyong Su, Yulan He, Lin Gui

TL;DR
This paper introduces a dynamic, layer-adaptive approach to visual grounding in large vision-language models, improving reasoning and localization by selecting task-specific layers during inference.
Contribution
It proposes VAQ, a new metric for identifying relevant layers, and LASER, a training-free method that adaptively enhances visual localization and reasoning tasks.
Findings
LASER improves VQA accuracy across diverse benchmarks
Layer-wise sensitivity analysis reveals different layers for simple and complex tasks
VAQ effectively identifies task-relevant layers for visual grounding
Abstract
Large Vision-Language Models (LVLMs) have advanced rapidly by aligning visual patches with the text embedding space, but a fixed visual-token budget forces images to be resized to a uniform pretraining resolution, often erasing fine-grained details and causing hallucinations via over-reliance on language priors. Recent attention-guided enhancement (e.g., cropping or region-focused attention allocation) alleviates this, yet it commonly hinges on a static "magic layer" empirically chosen on simple recognition benchmarks and thus may not transfer to complex reasoning tasks. In contrast to this static assumption, we propose a dynamic perspective on visual grounding. Through a layer-wise sensitivity analysis, we demonstrate that visual grounding is a dynamic process: while simple object recognition tasks rely on middle layers, complex visual search and reasoning tasks require visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
