GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

Mingzhe Huang; Weijun Wang; Xin Ding; Liang Mi; Hao Wen; Yuanchun Li; Lichen Pang; Shansong Yang; Yunxin Liu; Ting Cao

arXiv:2605.13375·cs.CV·May 14, 2026

GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

Mingzhe Huang, Weijun Wang, Xin Ding, Liang Mi, Hao Wen, Yuanchun Li, Lichen Pang, Shansong Yang, Yunxin Liu, Ting Cao

PDF

TL;DR

GRIP-VLM introduces a reinforcement learning-based framework for efficient vision-language model pruning, directly optimizing discrete token selection to improve speed without sacrificing accuracy.

Contribution

It presents a novel RL-driven pruning method that formulates token selection as a Markov Decision Process, overcoming limitations of gradient-based approaches.

Findings

01

Achieves up to 15% inference speedup with maintained accuracy.

02

Outperforms heuristic and supervised baselines across benchmarks.

03

Demonstrates superior Pareto efficiency in model compression.

Abstract

In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, visual token pruning is inherently a discrete, non-convex combinatorial problem; consequently, these continuous approximations frequently trap the optimization in sub-optimal local minima, especially under aggressive compression budgets. To overcome this fundamental bottleneck, we propose GRIP-VLM, a Group-Relative Importance Pruning framework driven by Reinforcement Learning. Rather than relying on smooth-gradient assumptions, GRIP-VLM formulates pruning as a Markov Decision Process, employing a Group Relative Policy Optimization (GRPO) paradigm anchored by supervised warm-up to directly explore the discrete…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.