SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

Samir Khaki; Junxian Guo; Jiaming Tang; Shang Yang; Yukang Chen; Konstantinos N. Plataniotis; Yao Lu; Song Han; Zhijian Liu

arXiv:2510.17777·cs.CV·October 21, 2025

SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

Samir Khaki, Junxian Guo, Jiaming Tang, Shang Yang, Yukang Chen, Konstantinos N. Plataniotis, Yao Lu, Song Han, Zhijian Liu

PDF

Open Access

TL;DR

SparseVILA introduces a decoupled visual sparsity approach that significantly accelerates large vision-language models during inference by pruning and retrieving visual tokens, maintaining accuracy while reducing latency.

Contribution

It proposes a training-free, architecture-agnostic framework that decouples visual token pruning and retrieval, enabling faster inference without sacrificing model performance.

Findings

01

Achieves up to 4.0x faster prefilling

02

Achieves 2.5x faster decoding

03

Overall 2.6x end-to-end speedup on long-video tasks

Abstract

Vision Language Models (VLMs) have rapidly advanced in integrating visual and textual reasoning, powering applications across high-resolution image understanding, long-video analysis, and multi-turn conversation. However, their scalability remains limited by the growing number of visual tokens that dominate inference latency. We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages. SparseVILA distributes sparsity across stages by pruning redundant visual tokens during prefill and retrieving only query-relevant tokens during decoding. This decoupled design matches leading prefill pruning methods while preserving multi-turn fidelity by retaining most of the visual cache so that query-aware tokens can be retrieved at each conversation round. Built on an AWQ-optimized inference pipeline, SparseVILA achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis