Bridge the Modality and Capability Gaps in Vision-Language Model Selection
Chao Yi, Yu-Hang He, De-Chuan Zhan, Han-Jia Ye

TL;DR
This paper introduces SWAB, a method that effectively predicts the performance of vision-language models on specific image classification tasks by bridging modality and capability gaps using optimal transport, without needing dataset images.
Contribution
The paper proposes SWAB, a novel approach that leverages optimal transport to bridge modality and capability gaps, enabling accurate VLM selection solely from text data.
Findings
SWAB accurately predicts VLM performance rankings on target datasets.
SWAB outperforms baseline methods in zero-shot image classification tasks.
The method effectively bridges modality and capability gaps in VLM selection.
Abstract
Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names. The expanding variety of Pre-Trained VLMs enhances the likelihood of identifying a suitable VLM for specific tasks. To better reuse the VLM resource and fully leverage its potential on different zero-shot image classification tasks, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo, relying solely on the text data of the target dataset without access to the dataset's images. In this paper, we analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection: the "Modality Gap" - the disparity in VLM's embeddings across two different modalities, making text a less reliable substitute for images; and the "Capability Gap" - the discrepancy between the VLM's overall ranking and its ranking for target dataset,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
