AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance
Weichen Zhang, Zhui Zhu, Ningbo Li, Shilong Tao, Kebin Liu, and Yunhao Liu

TL;DR
AdaptInfer introduces a dynamic, text-guided token pruning method for vision-language models that significantly reduces inference latency while maintaining high accuracy, outperforming static pruning approaches.
Contribution
The paper presents a novel adaptive token pruning framework that leverages dynamic text guidance and attention analysis to improve efficiency in vision-language model inference.
Findings
Reduces CUDA latency by 61.3%
Maintains 93.1% accuracy on LLaVA-1.5-7B
Outperforms state-of-the-art methods under same token budget
Abstract
Vision-language models (VLMs) have achieved impressive performance on multimodal reasoning tasks such as visual question answering, image captioning and so on, but their inference cost remains a significant challenge due to the large number of vision tokens processed during the prefill stage. Existing pruning methods often rely on directly using the attention patterns or static text prompt guidance, failing to exploit the dynamic internal signals generated during inference. To address these issues, we propose AdaptInfer, a plug-and-play framework for adaptive vision token pruning in VLMs. First, we introduce a fine-grained, dynamic text-guided pruning mechanism that reuses layer-wise text-to-text attention maps to construct soft priors over text-token importance, allowing more informed scoring of vision tokens at each stage. Second, we perform an offline analysis of cross-modal…
Peer Reviews
Decision·Submitted to ICLR 2026
- The method is both training-free and, critically, reuses attention maps that are already computed during the forward pass. This means it introduces almost no additional overhead (Sec 3.2.2), making it an extremely practical solution for real-world inference acceleration. - The paper shows clear and consistent performance gains over its closest SOTA-level competitor (SparseVLM) across multiple benchmarks and token budgets (Table 1, Fig 3). The latency test (Table 3) confirms that these theoret
- The specific schedule (layers 1, 10, 20) is architecture-dependent (LLaMA-7B). While the method for finding the schedule is general, it requires a new, non-trivial offline analysis (running 1000+ samples through the model and performing change-point detection) for every new backbone - The main paper emphasizes the locations of pruning, but the amount to prune at each location is also a critical hyperparameter. Appendix G.3 mentions that these "pruning ratios" are also selected, following a me
- The proposed dynamic cross-attention guided visual token pruning is enhanced by reusing text-token attantion to put a higher focus on important text tokens. - The proposed AdaptInfer leverage the observation of attention distribution shift to guide choices of hyperparameters like the insertion location of the pruning layer. - The proposed method provides notable savings in inference latency.
- The proposed method is only verified on 5 different VQA datasets, whereas the baseline methods are usually evaluated on much more diverse benchmarks. For example the PyramidDrop is evaluated on 16 different benchmarks. - Some benchmarks are also evaluated on video VQA benchmarks, including PyramidDrop and SparseVLM. A more well-rounded comparison will be more convincing especially when the improvement over SparseVLM is not consistent. - The newer SOTA benchmark VisionZip [1] is not included in
1. **Reasonable Design.** Using important text tokens to mine important vision tokens is reasonable.
1. **Weak Baseline.** This paper uses LLaVA-1.5 as a baseline, which is an open-source VLM released 2 years ago. It's too weak and completely outdated. For a training-free method, the evaluation should be conducted on more recent, competitive and widely used VLMs, such as Qwen2.5-VL. 2. **Lack Generalizability.** The consistent attention inflection point found in this paper is based on LLaVA-1.5. The author should check other base VLMs like Qwen2-VL and Qwen2.5-VL for a similar situation. 3. *
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques
