Learning to Rank Pre-trained Vision-Language Models for Downstream Tasks
Yuhe Ding, Bo Jiang, Aihua Zheng, Qin Xu, Jian Liang

TL;DR
This paper introduces VEGA, an unsupervised method for selecting the best vision-language model for downstream tasks by measuring modality alignment without labeled data.
Contribution
The paper proposes VEGA, a novel unsupervised approach for VLM selection based on graph alignment, eliminating the need for annotations or large language models.
Findings
VEGA accurately predicts VLM performance across multiple benchmarks.
VEGA outperforms existing class-name-only selection methods.
The approach is effective in diverse application scenarios.
Abstract
Vision language models (VLMs) like CLIP show stellar zero-shot capability on classification benchmarks. However, selecting the VLM with the highest performance on the unlabeled downstream task is non-trivial. Existing VLM selection methods focus on the class-name-only setting, relying on a supervised large-scale dataset and large language models, which may not be accessible or feasible during deployment. This paper introduces the problem of \textbf{unsupervised vision-language model selection}, where only unsupervised downstream datasets are available, with no additional information provided. To solve this problem, we propose a method termed Visual-tExtual Graph Alignment (VEGA), to select VLMs without any annotations by measuring the alignment of the VLM between the two modalities on the downstream task. VEGA is motivated by the pretraining paradigm of VLMs, which aligns features with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications
MethodsVEGA · Contrastive Language-Image Pre-training · Focus
