AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, Deheng Ye

TL;DR
AdaptVision introduces an adaptive visual token acquisition method for vision-language models, inspired by human active vision, which selectively gathers visual information to improve efficiency and accuracy in visual question answering tasks.
Contribution
The paper presents AdaptVision, a novel framework that enables VLMs to dynamically determine the necessary visual tokens using reinforcement learning and a decoupled turn policy optimization approach.
Findings
Achieves higher accuracy with fewer visual tokens compared to existing methods.
Demonstrates significant efficiency improvements across multiple VQA benchmarks.
Outperforms state-of-the-art efficient VLM approaches in experiments.
Abstract
Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* The proposed method combines the agentic visual tool use with the token compression, which is quite interesting and inspiring. * The proposed method allows improved performance with less visual token consumed, which could be effective and efficient under some circumstances. * The proposed method is clear and easy-to-follow.
* It seems that there lacks failing cases analyze. At what situation the proposed method can fail? * It seems that there lacks sufficient model discussion, such as oh the configurations of $\alpha$ and $\theta$. * Despite that this work claims an adaptive visual acquisition method, yet there lacks the statistical analysis on the adaptivity but mainly reports the averaged values. * How to adaptively adapt the number of selected visual tokens and the performance for different samples? * Doe
- The true contribution of this paper is DTPO. While "active perception" or "zoom-in" mechanisms have been explored, the training of such a policy is non-trivial. The authors' analysis of GRPO's failings (ambiguous credit and imbalanced optimization) is insightful, and their solution (decoupling the objective and advantage) is elegant and well-justified. - The data in Table 1 is impressive. Achieving 97.9% of the vanilla model's performance while using only 33% of the visual tokens is a SOTA-le
- The model is trained on VQA datasets where the task is often "find a specific detail." This "coarse-to-fine-crop" policy is perfectly suited for this. How would this policy fare on tasks requiring holistic scene understanding (e.g., "Describe the overall mood of the image") or complex, multi-object reasoning ("Are the person on the left and the person on the right related?") where a single crop is insufficient? The policy might be overfit to VQA-style problems. - The two key reward component
- Coarse-to-fine visual cropping approach is simple and intuitive. - Tool call based approach integrates easily with existing LLM ecosystem without breaking standard architectures - Decoupled RL (DTPO) is simple but effective idea, splitting the advantage into “tool part” vs. “answer part” is a tidy fix to balance multi-turn vs single-turn answers.
- Extra latency due to potential use of multiple inference turns - Learned cropping using RL has previously been explored in aesthetic/summarization context (e.g. [a]), and these prior approaches should have been benchmarked as off-the-shelf baselines [a] Cropper: Vision-Language Model for Image Cropping through In-Context Learning. Seung Hyun Lee et. al. CVPR 2025. - Some recent works such as ZoomEye[b] have explored learnable zooming/cropping capabilities for MLLMs, these should be reported
1. The method proposed by the author effectively reduces the usage of tokens while maintaining the original performance of the model.
1. Lack of necessary explanations for some design. 2. The authors' design lacks the necessary motivation. Detailed in Questions.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
