ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

Xubing Ye; Yukang Gan; Yixiao Ge; Xiao-Ping Zhang; Yansong Tang

arXiv:2412.00447·cs.CV·December 3, 2024

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, Yansong Tang

PDF

Open Access

TL;DR

ATP-LLaVA introduces an adaptive token pruning method for large vision language models, reducing computational costs by 75% with minimal performance loss through layer-wise and instance-wise strategies.

Contribution

It proposes a novel adaptive token pruning module and spatial augmented pruning strategy that dynamically adjust token retention based on input and layer, improving efficiency.

Findings

01

Reduces token count by 75% on average

02

Maintains performance with only 1.9% degradation

03

Effective across seven benchmark datasets

Abstract

Large Vision Language Models (LVLMs) have achieved significant success across multi-modal tasks. However, the computational cost of processing long visual tokens can be prohibitively expensive on resource-limited devices. Previous methods have identified redundancy in visual tokens within the Large Language Model (LLM) decoder layers and have mitigated this by pruning tokens using a pre-defined or fixed ratio, thereby reducing computational overhead. Nonetheless, we observe that the impact of pruning ratio varies across different LLM layers and instances (image-prompt pairs). Therefore, it is essential to develop a layer-wise and instance-wise vision token pruning strategy to balance computational cost and model performance effectively. We propose ATP-LLaVA, a novel approach that adaptively determines instance-specific token pruning ratios for each LLM layer. Specifically, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsPruning