Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models
Jialuo He, Huangxun Chen

TL;DR
E-AdaPrune is an energy-driven adaptive visual token pruning method that allocates tokens based on image information density, improving efficiency and performance in vision-language models without extra learnable parameters.
Contribution
The paper introduces E-AdaPrune, a novel spectral energy-based adaptive pruning framework that dynamically allocates visual tokens without additional parameters, enhancing model efficiency.
Findings
Achieves up to 0.6% average accuracy improvement across benchmarks.
Significantly boosts reasoning task performance by 5.1%.
Maintains low latency of 8ms per image with randomized SVD.
Abstract
Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6\%, including a significant +5.1\% relative boost on the MMVet reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
