Dynamic Token Reduction during Generation for Vision Language Models
Xiaoyu Liang, Chaofeng Guan, Jiaying Lu, Huiyao Chen, Huan Wang, Haoji, Hu

TL;DR
This paper introduces DyRate, a dynamic token pruning method for Vision-Language Models that adjusts visual token compression during generation, reducing computation while preserving response quality.
Contribution
It proposes a novel dynamic pruning strategy that adapts compression rates during generation based on attention analysis, improving efficiency without sacrificing performance.
Findings
Reduces computational complexity of VLMs during generation.
Maintains response quality despite aggressive token pruning.
Demonstrates effectiveness across multiple multimodal tasks.
Abstract
Vision-Language Models (VLMs) have achieved notable success in multimodal tasks but face practical limitations due to the quadratic complexity of decoder attention mechanisms and autoregressive generation. Existing methods like FASTV and VTW have achieved notable results in reducing redundant visual tokens, but these approaches focus on pruning tokens in a single forward pass without systematically analyzing the redundancy of visual tokens throughout the entire generation process. In this paper, we introduce a dynamic pruning strategy tailored for VLMs, namedDynamic Rate (DyRate), which progressively adjusts the compression rate during generation. Our analysis of the distribution of attention reveals that the importance of visual tokens decreases throughout the generation process, inspiring us to adopt a more aggressive compression rate. By integrating a lightweight predictor based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need · ADaptive gradient method with the OPTimal convergence rate · Pruning · Focus
