VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Haodong Duan; Xinyu Fang; Junming Yang; Xiangyu Zhao; Yuxuan Qiao; Mo Li; Amit Agarwal; Zhe Chen; Lin Chen; Yuan Liu; Yubo Ma; Hailong Sun; Yifan Zhang; Shiyin Lu; Tack Hwa Wong; Weiyun Wang; Peiheng Zhou; Xiaozhe Li; Chaoyou Fu; Junbo Cui; Jixuan Chen; Enxin Song; Song Mao; Shengyuan Ding; Tianhao Liang; Zicheng Zhang; Xiaoyi Dong; Yuhang Zang; Pan Zhang; Jiaqi Wang; Dahua Lin; Kai Chen

arXiv:2407.11691·cs.CV·August 29, 2025·3 cites

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Haodong Duan, Xinyu Fang, Junming Yang, Xiangyu Zhao, Yuxuan Qiao, Mo Li, Amit Agarwal, Zhe Chen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Xiaozhe Li, Chaoyou Fu, Junbo Cui, Jixuan Chen, Enxin Song, Song Mao

PDF

Open Access 2 Repos 1 Models 4 Datasets

TL;DR

VLMEvalKit is an open-source, comprehensive toolkit designed to evaluate large multi-modality models, supporting over 200 models and 80 benchmarks, and facilitating reproducible research in vision-language and future modalities.

Contribution

It introduces a unified, user-friendly framework for evaluating multi-modality models, streamlining data handling, inference, and metrics, and hosts a leaderboard to track progress.

Findings

01

Evaluated over 200 models across 80 benchmarks.

02

Streamlined evaluation process with a single interface.

03

Supports future modalities like audio and video.

Abstract

We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 200+ different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 80 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
tuandunghcmut/vlmeval
model

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques