On the test-time zero-shot generalization of vision-language models: Do   we really need prompt learning?

Maxime Zanella; Ismail Ben Ayed

arXiv:2405.02266·cs.CV·May 6, 2024

On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?

Maxime Zanella, Ismail Ben Ayed

PDF

Open Access 1 Repo

TL;DR

This paper introduces MTA, a robust, training-free test-time augmentation method for vision-language models that improves zero-shot generalization without prompt tuning or ad hoc rules.

Contribution

We propose MTA, a novel, hyperparameter-free test-time augmentation technique that outperforms prompt-based methods and does not require additional training or rules.

Findings

01

MTA surpasses prompt tuning in zero-shot generalization.

02

MTA is computationally efficient and easy to deploy.

03

MTA consistently improves performance across 15 datasets.

Abstract

The development of large vision-language models, notably CLIP, has catalyzed research into effective adaptation techniques, with a particular focus on soft prompt tuning. Conjointly, test-time augmentation, which utilizes multiple augmented views of a single image to enhance zero-shot generalization, is emerging as a significant area of interest. This has predominantly directed research efforts toward test-time prompt tuning. In contrast, we introduce a robust MeanShift for Test-time Augmentation (MTA), which surpasses prompt-based methods without requiring this intensive training procedure. This positions MTA as an ideal solution for both standalone and API-based applications. Additionally, our method does not rely on ad hoc rules (e.g., confidence threshold) used in some previous test-time augmentation techniques to filter the augmented views. Instead, MTA incorporates a quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

maxzanella/mta
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsHigh-Order Consensuses · Focus · Contrastive Language-Image Pre-training