On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?
Maxime Zanella, Ismail Ben Ayed

TL;DR
This paper introduces MTA, a robust, training-free test-time augmentation method for vision-language models that improves zero-shot generalization without prompt tuning or ad hoc rules.
Contribution
We propose MTA, a novel, hyperparameter-free test-time augmentation technique that outperforms prompt-based methods and does not require additional training or rules.
Findings
MTA surpasses prompt tuning in zero-shot generalization.
MTA is computationally efficient and easy to deploy.
MTA consistently improves performance across 15 datasets.
Abstract
The development of large vision-language models, notably CLIP, has catalyzed research into effective adaptation techniques, with a particular focus on soft prompt tuning. Conjointly, test-time augmentation, which utilizes multiple augmented views of a single image to enhance zero-shot generalization, is emerging as a significant area of interest. This has predominantly directed research efforts toward test-time prompt tuning. In contrast, we introduce a robust MeanShift for Test-time Augmentation (MTA), which surpasses prompt-based methods without requiring this intensive training procedure. This positions MTA as an ideal solution for both standalone and API-based applications. Additionally, our method does not rely on ad hoc rules (e.g., confidence threshold) used in some previous test-time augmentation techniques to filter the augmented views. Instead, MTA incorporates a quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsHigh-Order Consensuses · Focus · Contrastive Language-Image Pre-training
