A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs
Wangbo Zhao, Yizeng Han, Jiasheng Tang, Zhikai Li, Yibing Song, Kai, Wang, Zhangyang Wang, Yang You

TL;DR
This paper introduces SGL, a training-free method that uses small vision-language models to guide token pruning in large models, significantly accelerating inference while maintaining high accuracy.
Contribution
The paper proposes a novel, training-free approach leveraging small VLM attention maps to efficiently prune tokens in large VLMs, with an early exit mechanism for improved speed and accuracy.
Findings
Achieves up to 91% token pruning with minimal performance loss
Global attention maps from small VLMs closely resemble those of large VLMs
Method outperforms existing pruning techniques across 11 benchmarks
Abstract
Vision-language models (VLMs) have shown remarkable success across various multi-modal tasks, yet large VLMs encounter significant efficiency challenges due to processing numerous visual tokens. A promising approach to accelerating large VLM inference is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens. However, our study reveals three key insights: (i) Partial attention information is insufficient for accurately identifying critical visual tokens, resulting in suboptimal performance, especially at low token retention ratios; (ii) Global attention information, such as the attention map aggregated across all layers, more effectively preserves essential tokens and maintains comparable performance under aggressive pruning. However, the attention maps from all layers requires a full inference pass, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParticle Accelerators and Free-Electron Lasers · Particle accelerators and beam dynamics
MethodsSoftmax · Attention Is All You Need · Early exiting using confidence measures · Pruning
