A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for   Accelerating Large VLMs

Wangbo Zhao; Yizeng Han; Jiasheng Tang; Zhikai Li; Yibing Song; Kai; Wang; Zhangyang Wang; Yang You

arXiv:2412.03324·cs.CV·December 6, 2024

A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Zhikai Li, Yibing Song, Kai, Wang, Zhangyang Wang, Yang You

PDF

Open Access 1 Repo

TL;DR

This paper introduces SGL, a training-free method that uses small vision-language models to guide token pruning in large models, significantly accelerating inference while maintaining high accuracy.

Contribution

The paper proposes a novel, training-free approach leveraging small VLM attention maps to efficiently prune tokens in large VLMs, with an early exit mechanism for improved speed and accuracy.

Findings

01

Achieves up to 91% token pruning with minimal performance loss

02

Global attention maps from small VLMs closely resemble those of large VLMs

03

Method outperforms existing pruning techniques across 11 benchmarks

Abstract

Vision-language models (VLMs) have shown remarkable success across various multi-modal tasks, yet large VLMs encounter significant efficiency challenges due to processing numerous visual tokens. A promising approach to accelerating large VLM inference is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens. However, our study reveals three key insights: (i) Partial attention information is insufficient for accurately identifying critical visual tokens, resulting in suboptimal performance, especially at low token retention ratios; (ii) Global attention information, such as the attention map aggregated across all layers, more effectively preserves essential tokens and maintains comparable performance under aggressive pruning. However, the attention maps from all layers requires a full inference pass, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NUS-HPC-AI-Lab/SGL
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParticle Accelerators and Free-Electron Lasers · Particle accelerators and beam dynamics

MethodsSoftmax · Attention Is All You Need · Early exiting using confidence measures · Pruning