PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large   Vision-Language Models

Yu Meng; Kaiyuan Li; Chenran Huang; Chen Gao; Xinlei Chen; Yong Li,; Xiaoping Zhang

arXiv:2502.14504·cs.CV·February 21, 2025

PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models

Yu Meng, Kaiyuan Li, Chenran Huang, Chen Gao, Xinlei Chen, Yong Li,, Xiaoping Zhang

PDF

Open Access

TL;DR

PLPHP introduces a dynamic, fine-grained token pruning method for large vision-language models, significantly improving inference speed and reducing memory usage with minimal performance loss.

Contribution

It proposes a novel two-level pruning approach that adjusts token retention layer-wise and head-wise based on attention, enhancing efficiency of LVLMs.

Findings

01

18% faster decoding speed

02

Over 50% reduction in KV cache size

03

0.46% average performance drop

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a range of multimodal tasks. However, their inference efficiency is constrained by the large number of visual tokens processed during decoding. To address this challenge, we propose Per-Layer Per-Head Vision Token Pruning (PLPHP), a two-level fine-grained pruning method including Layer-Level Retention Rate Allocation and Head-Level Vision Token Pruning. Motivated by the Vision Token Re-attention phenomenon across decoder layers, we dynamically adjust token retention rates layer by layer. Layers that exhibit stronger attention to visual information preserve more vision tokens, while layers with lower vision attention are aggressively pruned. Furthermore, PLPHP applies pruning at the attention head level, enabling different heads within the same layer to independently retain critical context. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Pruning