TL;DR
LearnPruner introduces a two-stage token pruning method for vision-language models, effectively reducing computational load while maintaining high performance by leveraging insights into attention mechanisms.
Contribution
The paper proposes LearnPruner, a novel token pruning framework that improves efficiency in vision-language models by analyzing and utilizing attention biases in both vision encoders and LLMs.
Findings
Preserves approximately 95% of original performance with only 5.5% of vision tokens.
Achieves 3.2× inference acceleration.
Analyzes attention sink and bias issues in vision encoders and LLMs.
Abstract
Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this issue by pruning unimportant visual tokens, achieving substantial computational reduction while maintaining model performance. The core of token pruning lies in determining token importance, with current approaches primarily relying on attention scores from vision encoders or Large Language Models (LLMs). In this paper, we analyze the effectiveness of attention mechanisms in both vision encoders and LLMs. We find that vision encoders suffer from attention sink, leading to poor focus on informative foreground regions, while in LLMs, although prior studies have identified attention bias toward token positions, text-to-vision attention demonstrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
