GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
Mingzhe Huang, Weijun Wang, Xin Ding, Liang Mi, Hao Wen, Yuanchun Li, Lichen Pang, Shansong Yang, Yunxin Liu, Ting Cao

TL;DR
GRIP-VLM introduces a reinforcement learning-based framework for efficient vision-language model pruning, directly optimizing discrete token selection to improve speed without sacrificing accuracy.
Contribution
It presents a novel RL-driven pruning method that formulates token selection as a Markov Decision Process, overcoming limitations of gradient-based approaches.
Findings
Achieves up to 15% inference speedup with maintained accuracy.
Outperforms heuristic and supervised baselines across benchmarks.
Demonstrates superior Pareto efficiency in model compression.
Abstract
In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, visual token pruning is inherently a discrete, non-convex combinatorial problem; consequently, these continuous approximations frequently trap the optimization in sub-optimal local minima, especially under aggressive compression budgets. To overcome this fundamental bottleneck, we propose GRIP-VLM, a Group-Relative Importance Pruning framework driven by Reinforcement Learning. Rather than relying on smooth-gradient assumptions, GRIP-VLM formulates pruning as a Markov Decision Process, employing a Group Relative Policy Optimization (GRPO) paradigm anchored by supervised warm-up to directly explore the discrete…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
