TL;DR
ERASE is a two-stage adaptive vision token pruning framework that significantly reduces tokens in vision-language models while maintaining high accuracy, improving efficiency in multimodal understanding.
Contribution
It introduces an adaptive, two-stage token pruning method that better captures visual redundancy based on image complexity, outperforming prior approaches.
Findings
At 85% token pruning, ERASE retains 89.46% of accuracy on Qwen2.5-VL-7B.
ERASE outperforms previous methods, which retain only 78.1% accuracy at the same pruning ratio.
The framework effectively balances token reduction and model performance.
Abstract
Recent advancements in Vision-Language Models (VLMs) enable large language models (LLMs) to process high-resolution images, significantly improving real-world multimodal understanding. However, this capability introduces a large number of vision tokens, resulting in substantial computational overhead. To mitigate this issue, various vision token pruning methods have been proposed. Nevertheless, existing approaches predominantly rely on learned semantic features within the model to capture visual redundancy. Moreover, they lack adaptive mechanisms to adjust pruning strategies according to the complexity of the input image. In this paper, we propose ERASE, a two-stage vision token pruning framework that identifies and retains salient tokens through pruning strategies adaptive to image complexity. Experiment results demonstrate that ERASE significantly reduces vision tokens while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
