SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang; Chun-Kai Fan; Junpeng Ma; Wenzhao Zheng; Tao Huang; Kuan Cheng; Denis Gudovskiy; Tomoyuki Okuno; Yohei Nakata; Kurt Keutzer; Shanghang Zhang

arXiv:2410.04417·cs.CV·June 4, 2025

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

SparseVLM introduces a training-free, text-guided method for visual token sparsification in vision-language models, significantly reducing computational costs while maintaining high accuracy in image and video understanding tasks.

Contribution

It proposes a novel, training-free token pruning strategy guided by text, with adaptive sparsification and token recycling, improving efficiency of VLMs without extra training.

Findings

01

Achieves 54% reduction in FLOPs on LLaVA

02

Decreases CUDA latency by 37%

03

Maintains 97% of original accuracy

Abstract

In vision-language models (VLMs), visual tokens usually bear a significant amount of computational overhead despite sparsity of information in them when compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens using certain training data. Differently, we propose a text-guided training-free token optimization mechanism dubbed SparseVLM that eliminates the need of extra parameters or fine-tuning costs. Given that visual tokens complement text tokens in VLM's linguistic reasoning, we select relevant text tokens to rate the significance of visual tokens using self-attention matrices and, then, prune visual tokens using the proposed strategy to maximize sparsity while retaining information. In particular, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gumpest/sparsevlms
pytorchOfficial

Videos

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning