IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models
Dong-Jae Lee, Sunghyun Baek, and Junmo Kim

TL;DR
This paper introduces a novel, training-free token pruning method for large vision-language models based on attention's dual form, improving efficiency without retraining.
Contribution
It reformulates attention as an implicit linear layer, enabling optimal token subset selection through a new metric and a progressive selection algorithm.
Findings
Achieves better performance-efficiency trade-offs in experiments.
Provides a new perspective on existing token pruning methods.
Extends the dual form perspective to standard softmax attention.
Abstract
Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning methods mitigate this issue through empirical approaches while overlooking the internal mechanism of attention. In this paper, we propose a novel training free token pruning framework grounded in the dual form perspective of attention. We reformulate attention as an implicit linear layer whose weight matrix is the sum of rank 1 outer products, each generated by a single token's key value pair. Token pruning thus reduces to selecting an optimal subset of these rank 1 updates that best approximates the original dual weight matrix. Extending this perspective to standard softmax attention in LVLMs, we derive a novel metric quantifying both a token's information magnitude and information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
