CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

Sangin Lee; Yukyung Choi

arXiv:2605.13178·cs.CV·May 14, 2026

CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

Sangin Lee, Yukyung Choi

PDF

1 Repo

TL;DR

LiteLVLM is a training-free, text-guided token pruning method that improves pixel grounding efficiency in vision-language models by selectively retaining referent region tokens, achieving significant speed and memory savings.

Contribution

It introduces a novel, training-free token pruning strategy based on reversing CLIP's visual-text similarity ranking for better pixel grounding performance.

Findings

01

Outperforms existing methods by over 5% across various token budgets.

02

Maintains 90% of original performance with 22% speedup.

03

Reduces memory usage by 2.3 times without training or fine-tuning.

Abstract

In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less informative visual tokens for image understanding tasks. However, these methods struggle with pixel grounding tasks, where token importance is highly contingent on the input text. Through an in-depth analysis of CLIP, we observe that visual tokens located within referent regions often exhibit low similarity to the textual representation. Motivated by this insight, we introduce LiteLVLM, a training-free, text-guided token pruning strategy for efficient pixel grounding inference. By reversing the ranking of CLIP's visual-text similarity, LiteLVLM effectively retains visual tokens covering the referent regions, while recovering context tokens to enable clear…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sejong-rcv/LiteLVLM
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.