LOVM: Language-Only Vision Model Selection
Orr Zohar, Shih-Cheng Huang, Kuan-Chieh Wang, Serena Yeung

TL;DR
This paper introduces LOVM, a new benchmark and task for selecting the best pre-trained vision-language models based solely on text descriptions, eliminating the need for dataset-specific evaluations.
Contribution
We propose a novel task and benchmark for zero-shot model selection using only text descriptions, enabling efficient VLM evaluation without access to downstream datasets.
Findings
Established the LOVM benchmark with evaluations of 35 VLMs across 23 datasets.
Demonstrated the effectiveness of text-based model ranking methods.
Provided insights into zero-shot performance prediction for VLMs.
Abstract
Pre-trained multi-modal vision-language models (VLMs) are becoming increasingly popular due to their exceptional performance on downstream vision applications, particularly in the few- and zero-shot settings. However, selecting the best-performing VLM for some downstream applications is non-trivial, as it is dataset and task-dependent. Meanwhile, the exhaustive evaluation of all available VLMs on a novel application is not only time and computationally demanding but also necessitates the collection of a labeled dataset for evaluation. As the number of open-source VLM variants increases, there is a need for an efficient model selection strategy that does not require access to a curated evaluation dataset. This paper proposes a novel task and benchmark for efficiently evaluating VLMs' zero-shot performance on downstream applications without access to the downstream task dataset.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
