VHELM: A Holistic Evaluation of Vision Language Models
Tony Lee, Haoqin Tu, Chi Heem Wong, Wenhao Zheng, Yiyang Zhou, Yifan, Mai, Josselin Somerville Roberts, Michihiro Yasunaga, Huaxiu Yao, Cihang Xie,, Percy Liang

TL;DR
VHELM provides a comprehensive, multi-dimensional evaluation framework for vision-language models, covering perception, reasoning, bias, multilinguality, and safety, enabling fair comparison and revealing new insights.
Contribution
This work extends the HELM framework to VLMs, standardizes evaluation procedures, and offers a lightweight, automatic benchmark for holistic assessment of models.
Findings
Efficiency-focused models perform worse on bias benchmarks.
Full models outperform lightweight ones on certain aspects.
The framework enables fair, comprehensive comparison of 22 VLMs.
Abstract
Current benchmarks for assessing vision-language models (VLMs) often focus on their perception or problem-solving capabilities and neglect other critical aspects such as fairness, multilinguality, or toxicity. Furthermore, they differ in their evaluation procedures and the scope of the evaluation, making it difficult to compare models. To address these issues, we extend the HELM framework to VLMs to present the Holistic Evaluation of Vision Language Models (VHELM). VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. In doing so, we produce a comprehensive, multi-dimensional view of the capabilities of the VLMs across these important factors. In addition, we standardize the standard inference parameters, methods of prompting, and evaluation metrics to enable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Metaphor, and Cognition · Categorization, perception, and language · Multimodal Machine Learning Applications
MethodsFocus
