Learning to Rank Pre-trained Vision-Language Models for Downstream Tasks

Yuhe Ding; Bo Jiang; Aihua Zheng; Qin Xu; Jian Liang

arXiv:2412.20682·cs.CV·December 31, 2024

Learning to Rank Pre-trained Vision-Language Models for Downstream Tasks

Yuhe Ding, Bo Jiang, Aihua Zheng, Qin Xu, Jian Liang

PDF

Open Access

TL;DR

This paper introduces VEGA, an unsupervised method for selecting the best vision-language model for downstream tasks by measuring modality alignment without labeled data.

Contribution

The paper proposes VEGA, a novel unsupervised approach for VLM selection based on graph alignment, eliminating the need for annotations or large language models.

Findings

01

VEGA accurately predicts VLM performance across multiple benchmarks.

02

VEGA outperforms existing class-name-only selection methods.

03

The approach is effective in diverse application scenarios.

Abstract

Vision language models (VLMs) like CLIP show stellar zero-shot capability on classification benchmarks. However, selecting the VLM with the highest performance on the unlabeled downstream task is non-trivial. Existing VLM selection methods focus on the class-name-only setting, relying on a supervised large-scale dataset and large language models, which may not be accessible or feasible during deployment. This paper introduces the problem of \textbf{unsupervised vision-language model selection}, where only unsupervised downstream datasets are available, with no additional information provided. To solve this problem, we propose a method termed Visual-tExtual Graph Alignment (VEGA), to select VLMs without any annotations by measuring the alignment of the VLM between the two modalities on the downstream task. VEGA is motivated by the pretraining paradigm of VLMs, which aligns features with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications

MethodsVEGA · Contrastive Language-Image Pre-training · Focus