Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

Jinda Lu; Shuo Wang; Yanbin Hao; Haifeng Liu; Xiang Wang; Meng Wang

arXiv:2407.14117·cs.CV·July 22, 2024

Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

Jinda Lu, Shuo Wang, Yanbin Hao, Haifeng Liu, Xiang Wang, Meng Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Visual Content Refinement method that enhances low-shot CLIP adaptation by focusing on local image details through multi-scale decomposition and view selection, leading to significant performance improvements.

Contribution

The proposed VCR method refines image content before adaptation, improving low-shot CLIP performance without additional training parameters.

Findings

01

Achieves about 2% average improvement over Tip-Adapter on few-shot classification.

02

Effective across 13 datasets and 3 benchmark tasks.

03

Enhances focus on global and local image features without extra training.

Abstract

Recent adaptations can boost the low-shot capability of Contrastive Vision-Language Pre-training (CLIP) by effectively facilitating knowledge transfer. However, these adaptation methods are usually operated on the global view of an input image, and thus biased perception of partial local details of the image. To solve this problem, we propose a Visual Content Refinement (VCR) before the adaptation calculation during the test stage. Specifically, we first decompose the test image into different scales to shift the feature extractor's attention to the details of the image. Then, we select the image view with the max prediction margin in each scale to filter out the noisy image views, where the prediction margins are calculated from the pre-trained CLIP model. Finally, we merge the content of the aforementioned selected image views based on their scales to construct a new robust…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

injadlu/VCR
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Enhancement Techniques · Advanced Vision and Imaging · Video Analysis and Summarization

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · Adapter · Focus