Patch Ranking: Efficient CLIP by Learning to Rank Local Patches

Cheng-En Wu; Jinhong Lin; Yu Hen Hu; Pedro Morgado

arXiv:2409.14607·cs.CV·December 2, 2024

Patch Ranking: Efficient CLIP by Learning to Rank Local Patches

Cheng-En Wu, Jinhong Lin, Yu Hen Hu, Pedro Morgado

PDF

Open Access 1 Repo

TL;DR

This paper introduces a method to efficiently prune patch tokens in CLIP's ViT backbone using a learned ranking system, reducing computation with minimal accuracy loss across multiple datasets.

Contribution

It proposes a novel token pruning framework with a lightweight predictor and learnable tokens, significantly improving CLIP's efficiency while maintaining performance.

Findings

01

Reduced 40% of patch tokens with only 0.3% accuracy loss

02

Developed a greedy search for optimal token ranking

03

Achieved systematic token pruning across seven datasets

Abstract

Contrastive image-text pre-trained models such as CLIP have shown remarkable adaptability to downstream tasks. However, they face challenges due to the high computational requirements of the Vision Transformer (ViT) backbone. Current strategies to boost ViT efficiency focus on pruning patch tokens but fall short in addressing the multimodal nature of CLIP and identifying the optimal subset of tokens for maximum performance. To address this, we propose greedy search methods to establish a "Golden Ranking" and introduce a lightweight predictor specifically trained to approximate this Ranking. To compensate for any performance degradation resulting from token pruning, we incorporate learnable visual tokens that aid in restoring and potentially enhancing the model's performance. Our work presents a comprehensive and systematic investigation of pruning tokens within the ViT backbone of CLIP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CEWu/PatchRanking
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRough Sets and Fuzzy Logic · Image Retrieval and Classification Techniques · Text and Document Classification Technologies

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Vision Transformer · Softmax · Layer Normalization · Dropout