ResPrune: Text-Conditioned Subspace Reconstruction for Visual Token Pruning in Large Vision-Language Models

Xu Li; Yi Zheng; Yuxuan Liang; Zhe Liu; Xiaolei Chen; Haotian Chen; Rui Zhu; Xiangyang Xue

arXiv:2603.21105·cs.LG·March 24, 2026

ResPrune: Text-Conditioned Subspace Reconstruction for Visual Token Pruning in Large Vision-Language Models

Xu Li, Yi Zheng, Yuxuan Liang, Zhe Liu, Xiaolei Chen, Haotian Chen, Rui Zhu, Xiangyang Xue

PDF

Open Access

TL;DR

ResPrune is a training-free, cross-modal guided visual token pruning method that efficiently reduces computation in large vision-language models by selecting informative tokens based on subspace reconstruction.

Contribution

It introduces a novel subspace reconstruction approach for token pruning conditioned on textual relevance, without requiring retraining or architectural changes.

Findings

01

Outperforms existing pruning methods on multiple benchmarks.

02

Reduces computation, memory, and latency significantly.

03

Compatible with various LVLM backbones.

Abstract

Large Vision-Language Models (LVLMs) rely on dense visual tokens to capture fine-grained visual information, but processing all these tokens incurs substantial computational and memory overhead during inference. To address this issue, we propose ResPrune, a training-free visual token pruning framework that enables efficient LVLM inference by selecting a compact yet informative subset of visual tokens. ResPrune formulates visual token pruning as a subspace reconstruction problem and employs a greedy subspace expansion strategy guided by residual energy, allowing it to preserve the geometric structure of the original visual token space. To further incorporate cross modal alignment, the selection process is conditioned on textual relevance, encouraging the retention of tokens that are both informative and instruction-relevant. The proposed method is lightweight and model-agnostic, and can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning