Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement

Zipeng Zhu; Zhanghao Hu; Qinglin Zhu; Yuxi Hong; Yijun Liu; Jingyong Su; Yulan He; Lin Gui

arXiv:2602.04304·cs.CV·February 5, 2026

Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement

Zipeng Zhu, Zhanghao Hu, Qinglin Zhu, Yuxi Hong, Yijun Liu, Jingyong Su, Yulan He, Lin Gui

PDF

Open Access

TL;DR

This paper introduces a dynamic, layer-adaptive approach to visual grounding in large vision-language models, improving reasoning and localization by selecting task-specific layers during inference.

Contribution

It proposes VAQ, a new metric for identifying relevant layers, and LASER, a training-free method that adaptively enhances visual localization and reasoning tasks.

Findings

01

LASER improves VQA accuracy across diverse benchmarks

02

Layer-wise sensitivity analysis reveals different layers for simple and complex tasks

03

VAQ effectively identifies task-relevant layers for visual grounding

Abstract

Large Vision-Language Models (LVLMs) have advanced rapidly by aligning visual patches with the text embedding space, but a fixed visual-token budget forces images to be resized to a uniform pretraining resolution, often erasing fine-grained details and causing hallucinations via over-reliance on language priors. Recent attention-guided enhancement (e.g., cropping or region-focused attention allocation) alleviates this, yet it commonly hinges on a static "magic layer" empirically chosen on simple recognition benchmarks and thus may not transfer to complex reasoning tasks. In contrast to this static assumption, we propose a dynamic perspective on visual grounding. Through a layer-wise sensitivity analysis, we demonstrate that visual grounding is a dynamic process: while simple object recognition tasks rely on middle layers, complex visual search and reasoning tasks require visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications