Beyond Attention Scores: SVD-Based Vision Token Pruning for Efficient Vision-Language Models

Yvon Apedo; Martyna Poreba; Michal Szczepanski; Samia Bouchafa

arXiv:2604.11530·cs.CV·May 21, 2026

Beyond Attention Scores: SVD-Based Vision Token Pruning for Efficient Vision-Language Models

Yvon Apedo, Martyna Poreba, Michal Szczepanski, Samia Bouchafa

PDF

TL;DR

This paper introduces SVD-Prune, a novel, training-free token pruning method for vision-language models that uses SVD and leverage scores to efficiently select the most informative vision tokens, reducing computation while maintaining performance.

Contribution

The paper presents SVD-Prune, a new SVD-based, training-free token pruning technique that outperforms existing methods, especially at high pruning ratios, by preserving globally significant tokens.

Findings

01

SVD-Prune outperforms prior methods at extreme token budgets.

02

It maintains strong performance with only 16 or 32 vision tokens.

03

The method is training-free and plug-and-play.

Abstract

Vision-Language Models (VLMs) have revolutionized multi-modal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms. However, these criteria suffer from positional bias and information dispersion, limiting their ability to preserve essential content at high pruning ratios and leading to performance degradation on visually detailed images. To address these issues, we propose SVD-Prune, a training-free, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects the top-k tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.