AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models
Jan Hendrik Metzen, Piyapat Saranrittichai, Chaithanya Kumar Mummadi

TL;DR
AutoCLIP introduces an unsupervised method for auto-tuning zero-shot classifiers by dynamically weighting class descriptors per image, significantly improving accuracy across various vision-language models and datasets.
Contribution
It proposes a novel auto-tuning approach for zero-shot classifiers that adjusts descriptor weights at inference time, enhancing performance without supervision.
Findings
AutoCLIP outperforms baseline methods by up to 3% accuracy.
The method is fully unsupervised with minimal additional computation.
AutoCLIP is easy to implement and effective across multiple models and datasets.
Abstract
Classifiers built upon vision-language models such as CLIP have shown remarkable zero-shot performance across a broad range of image classification tasks. Prior work has studied different ways of automatically creating descriptor sets for every class based on prompt templates, ranging from manually engineered templates over templates obtained from a large language model to templates built from random words and characters. Up until now, deriving zero-shot classifiers from the respective encoded class descriptors has remained nearly unchanged, i.e., classify to the class that maximizes cosine similarity between its averaged encoded class descriptors and the image encoding. However, weighing all class descriptors equally can be suboptimal when certain descriptors match visual clues on a given image better than others. In this work, we propose AutoCLIP, a method for auto-tuning zero-shot…
Peer Reviews
Decision·Submitted to ICLR 2024
This paper has good writing and easy to understand and follow the proposed idea. The motivation is reasonable by leveraging the knowledge of visual-language models (VLMs) and automatically tuning the per-image weights to each prompt template at inference time. In addition, they also discuss automatically tuning AUTOCLIP’s step size to control the entropy of the prompt template’s weights. Overall the proposed method is simple with only modifying a few steps in a general zero-shot classifier algor
Even though the proposed method is simple and effective, the method looks like a naive modification method to replace the uniform-weighted average descriptor encodings with weighted average encodings based on the existing algorithms which may limit the paper's novelty.
1. This paper proposed a simple and effective method for increasing the model performance. 2. Extensive experiments on multiple datasets are conducted and highlight the performance of the proposed method.
1. The main concern of this work is its novelty. The idea of weighing the text features from different prompts has been proposed [a]. It is unclear why is the proposed method different than [a]. 2. While adopting the weighting strategy in [a] during test time is a solution for boosting the classification performance, one could simply average the topK text prompts that have the topK cosine similarity with the image feature. Is the proposed method better? 3. The author is suggested to put the arg
1. The idea and motivation of AutoCLIP make sense. It is a reasonable way to improve the similarity calculation during zero-shot classification. 2. AutoCLIP does not require additional training beyond the vision and language model. Therefore, it is easy to apply AutoCLIP to existing zero-shot classifiers for improving accuracy. 3. The experiments are thoroughly conducted covering many classification datasets and ranging from CLIP to CoCa. 4. Experimental results suggest AutoCLIP is an effective
1. The is one limitation: Current experiments suggest AutoCLIP can only be used for zero-shot classification, which is only one useful aspect of large vision and language model. Large vision and language model like CLIP is not about just doing zero-shot classification. For real applications, the impact of these models also lies in downstream tasks. For example, finetuning from CLIP pretrained parameters or using CLIP to directly assist the downstream tasks. It's better to explain some zero-shot
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
