Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation
Jinda Lu, Shuo Wang, Yanbin Hao, Haifeng Liu, Xiang Wang, Meng Wang

TL;DR
This paper introduces a Visual Content Refinement method that enhances low-shot CLIP adaptation by focusing on local image details through multi-scale decomposition and view selection, leading to significant performance improvements.
Contribution
The proposed VCR method refines image content before adaptation, improving low-shot CLIP performance without additional training parameters.
Findings
Achieves about 2% average improvement over Tip-Adapter on few-shot classification.
Effective across 13 datasets and 3 benchmark tasks.
Enhances focus on global and local image features without extra training.
Abstract
Recent adaptations can boost the low-shot capability of Contrastive Vision-Language Pre-training (CLIP) by effectively facilitating knowledge transfer. However, these adaptation methods are usually operated on the global view of an input image, and thus biased perception of partial local details of the image. To solve this problem, we propose a Visual Content Refinement (VCR) before the adaptation calculation during the test stage. Specifically, we first decompose the test image into different scales to shift the feature extractor's attention to the details of the image. Then, we select the image view with the max prediction margin in each scale to filter out the noisy image views, where the prediction margins are calculated from the pre-trained CLIP model. Finally, we merge the content of the aforementioned selected image views based on their scales to construct a new robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Enhancement Techniques · Advanced Vision and Imaging · Video Analysis and Summarization
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · Adapter · Focus
