Dynamic Token Reduction during Generation for Vision Language Models

Xiaoyu Liang; Chaofeng Guan; Jiaying Lu; Huiyao Chen; Huan Wang; Haoji; Hu

arXiv:2501.14204·cs.CV·January 27, 2025

Dynamic Token Reduction during Generation for Vision Language Models

Xiaoyu Liang, Chaofeng Guan, Jiaying Lu, Huiyao Chen, Huan Wang, Haoji, Hu

PDF

Open Access

TL;DR

This paper introduces DyRate, a dynamic token pruning method for Vision-Language Models that adjusts visual token compression during generation, reducing computation while preserving response quality.

Contribution

It proposes a novel dynamic pruning strategy that adapts compression rates during generation based on attention analysis, improving efficiency without sacrificing performance.

Findings

01

Reduces computational complexity of VLMs during generation.

02

Maintains response quality despite aggressive token pruning.

03

Demonstrates effectiveness across multiple multimodal tasks.

Abstract

Vision-Language Models (VLMs) have achieved notable success in multimodal tasks but face practical limitations due to the quadratic complexity of decoder attention mechanisms and autoregressive generation. Existing methods like FASTV and VTW have achieved notable results in reducing redundant visual tokens, but these approaches focus on pruning tokens in a single forward pass without systematically analyzing the redundancy of visual tokens throughout the entire generation process. In this paper, we introduce a dynamic pruning strategy tailored for VLMs, namedDynamic Rate (DyRate), which progressively adjusts the compression rate during generation. Our analysis of the distribution of attention reveals that the importance of visual tokens decreases throughout the generation process, inspiring us to adopt a more aggressive compression rate. By integrating a lightweight predictor based on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need · ADaptive gradient method with the OPTimal convergence rate · Pruning · Focus