Towards Lossless Ultimate Vision Token Compression for VLMs
Dehua Zheng, Mouxiao Huang, Borui Jiang, Hailin Hu, Xinghao Chen

TL;DR
This paper introduces LUVC, a novel framework for lossless visual token compression in vision-language models, significantly improving inference speed while maintaining accuracy, through iterative merging and spectrum pruning techniques.
Contribution
The paper proposes a new lossless token compression method combining iterative merging and spectrum pruning, enhancing efficiency without retraining, and generalizing across various VLMs.
Findings
Achieves 2x inference speedup with negligible accuracy loss.
Compatible with modern attention mechanisms like FlashAttention.
Enables immediate deployment without additional training.
Abstract
Visual language models encounter challenges in computational efficiency and latency, primarily due to the substantial redundancy in the token representations of high-resolution images and videos. Current attention/similarity-based compression algorithms suffer from either position bias or class imbalance, leading to significant accuracy degradation. They also fail to generalize to shallow LLM layers, which exhibit weaker cross-modal interactions. To address this, we extend token compression to the visual encoder through an effective iterative merging scheme that is orthogonal in spatial axes to accelerate the computation across the entire VLM. Furthermoer, we integrate a spectrum pruning unit into LLM through an attention/similarity-free low-pass filter, which gradually prunes redundant visual tokens and is fully compatible to modern FlashAttention. On this basis, we propose Lossless…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Data Compression Techniques
