Efficient Vision-Language Reasoning via Adaptive Token Pruning
Xue Li, Xiaonan Song, Henry Hu

TL;DR
This paper presents Adaptive Token Pruning (ATP), a dynamic method that reduces computational costs in vision-language models by selectively retaining the most relevant tokens, leading to faster inference with minimal accuracy loss.
Contribution
The paper introduces ATP, a novel, input-adaptive token pruning mechanism that improves efficiency without altering the backbone architecture of existing vision-language models.
Findings
Reduces inference FLOPs by ~40%.
Achieves ~1.5x speedup in latency.
Maintains accuracy with less than 1% loss.
Abstract
Real-world deployment of Vision-Language Models (VLMs) is hindered by high computational demands, as existing architectures inefficiently process all tokens uniformly. We introduce Adaptive Token Pruning (ATP), a dynamic inference mechanism that retains only the most informative tokens based on contextual relevance. ATP operates at the vision-language interface, assigning a hybrid importance score combining ViT CLS attention (intra-modal saliency) and CLIP text-image similarity (inter-modal relevance) to keep top-K tokens for the LLM. Unlike static compression, ATP adapts to each input without modifying the backbone. Proposed as a lightweight gating module, ATP is compatible with popular backbones like BLIP-2, LLaVA, and Flamingo. Preliminary evaluations across VQAv2, GQA, and COCO indicate that ATP reduces inference FLOPs by around 40% and achieves roughly 1.5x speedups in end-to-end…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Topic Modeling
