VHELM: A Holistic Evaluation of Vision Language Models

Tony Lee; Haoqin Tu; Chi Heem Wong; Wenhao Zheng; Yiyang Zhou; Yifan; Mai; Josselin Somerville Roberts; Michihiro Yasunaga; Huaxiu Yao; Cihang Xie,; Percy Liang

arXiv:2410.07112·cs.CV·October 25, 2024

VHELM: A Holistic Evaluation of Vision Language Models

Tony Lee, Haoqin Tu, Chi Heem Wong, Wenhao Zheng, Yiyang Zhou, Yifan, Mai, Josselin Somerville Roberts, Michihiro Yasunaga, Huaxiu Yao, Cihang Xie,, Percy Liang

PDF

Open Access 1 Repo

TL;DR

VHELM provides a comprehensive, multi-dimensional evaluation framework for vision-language models, covering perception, reasoning, bias, multilinguality, and safety, enabling fair comparison and revealing new insights.

Contribution

This work extends the HELM framework to VLMs, standardizes evaluation procedures, and offers a lightweight, automatic benchmark for holistic assessment of models.

Findings

01

Efficiency-focused models perform worse on bias benchmarks.

02

Full models outperform lightweight ones on certain aspects.

03

The framework enables fair, comprehensive comparison of 22 VLMs.

Abstract

Current benchmarks for assessing vision-language models (VLMs) often focus on their perception or problem-solving capabilities and neglect other critical aspects such as fairness, multilinguality, or toxicity. Furthermore, they differ in their evaluation procedures and the scope of the evaluation, making it difficult to compare models. To address these issues, we extend the HELM framework to VLMs to present the Holistic Evaluation of Vision Language Models (VHELM). VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. In doing so, we produce a comprehensive, multi-dimensional view of the capabilities of the VLMs across these important factors. In addition, we standardize the standard inference parameters, methods of prompting, and evaluation metrics to enable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stanford-crfm/helm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage, Metaphor, and Cognition · Categorization, perception, and language · Multimodal Machine Learning Applications

MethodsFocus