Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
Muyang Li, Yucheng Liu, Jianbo Ma, Elliot Osborne, Bo Han, Tongliang Liu

TL;DR
This paper introduces a Gromov-Wasserstein distance-based metric to better select vision encoders for vision-language models, outperforming traditional size or accuracy metrics in predicting model performance.
Contribution
It reveals the importance of structural similarity across modalities for VLMs and proposes a new metric for effective model selection based on this insight.
Findings
Common metrics like size or zero-shot accuracy weakly correlate with VLM performance.
Gromov-Wasserstein distance effectively predicts VLM performance before training.
The proposed metric outperforms existing model selection strategies in empirical tests.
Abstract
Vision-Language Models (VLMs) have enhanced traditional LLMs with visual capabilities through the integration of vision encoders. While recent works have explored various combinations of vision encoders and LLMs, there still lacks a principled understanding of what makes a vision encoder suitable for VLM alignment. In this paper, we systematically investigate this question via comprehensive experiments on a curated collection of 19 pre-trained vision encoders from diverse sources. We first demonstrate that common practices, such as choosing encoders with the largest size or highest zero-shot accuracy, consistently fail to identify optimal models. In fact, these metrics show only weak to moderate correlation with VLM performance. This intriguing finding begs a fundamental question: What factors of vision-encoders matter in VLM? Through comprehensive analysis, we identify that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
