AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models
Changwoo Baek, Jouwon Song, Sohyeon Kim, Kyeongbo Kong

TL;DR
This paper empirically analyzes attention and diversity-based visual token pruning in large vision-language models, revealing their strengths and limitations, and proposes adaptive strategies that improve performance and reduce hallucinations.
Contribution
It provides a comprehensive empirical analysis of pruning methods, introduces image-aware adjustments, and presents a simple adaptive pruning mechanism with improved results.
Findings
Diversity-oriented pruning preserves less feature diversity than intended.
Attention-based pruning is more effective on simple images.
Diversity-based methods handle complex images better.
Abstract
Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches' characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank-based quantitative analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is…
Peer Reviews
Decision·ICLR 2026 Poster
1. Provides diverse experiments to validate its approach 2. The paper is well-written and easy to follow.
1. I am not sure the findings of the paper is novel enough. [1] shows that "a satisfied pruning method should jointly take the token importance and diversity into account." to preserve both local (important) and global (diverse) information, which is what the paper proposes to do. 2. I think an important related work is missed [2]. It also determines pruning threshold based on input instance adaptively. 3. From my understanding, the number of visual tokens is fixed in the experiments. Why didn’t
Comprehensive empirical analysis and validation: - It goes beyond performance reporting and explores why each method behaves differently, grounded in measurable concepts like attention entropy and effective rank (erank). - Extensive experiments on nine multimodal benchmarks (VQAv2, GQA, TextVQA, ScienceQA, MMBench, etc.) and the CHAIR hallucination dataset demonstrate the robustness of the approach. Insightful findings with practical relevance: - The study reveals clear patterns: attention-base
Limited novelty in algorithmic design: - The proposed adaptive pruning framework (AdaVTP) mainly combines two existing ideas — attention-based and diversity-based pruning — using an adaptive threshold determined by image complexity. - While insightful, this combination strategy is heuristic rather than fundamentally new in algorithmic form. Limited scope of model diversity: - Most experiments are based on a single LVLM backbone (LLaVA-1.5-7B). - The generalizability of the findings to other arc
1. The paper provides a thorough and systematic empirical comparison between attention- and diversity-based pruning strategies, which is less explored in depth before. 2. The adoption of effective rank and attention entropy as quantitative measures for image complexity is conceptually reasonable.
1. The proposed adaptive thresholding strategy is relatively simple and heuristic (a logarithmic mapping between erank and threshold). It does not provide strong methodological or theoretical innovation beyond straightforward empirical observations. 2. The proposed adaptive thresholding strategy introduces several hyperparameters, notably the scaling coefficients and other implementation choices. These parameters may influence pruning behavior, yet the paper does not provide a clear justificati
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Adversarial Robustness in Machine Learning
