LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Kaichen Zhang; Bo Li; Peiyuan Zhang; Fanyi Pu; Joshua Adrian Cahyono; Kairui Hu; Shuai Liu; Yuanhan Zhang; Jingkang Yang; Chunyuan Li; Ziwei Liu

arXiv:2407.12772·cs.CL·September 19, 2025·1 cites

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, Ziwei Liu

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces LMMS-EVAL, a comprehensive benchmark framework for evaluating large multimodal models, addressing challenges of coverage, cost, and contamination, and proposes practical solutions like LMMS-EVAL LITE and LIVEBENCH for more reliable assessments.

Contribution

It presents a unified multimodal evaluation framework, introduces a pruned toolkit for efficiency, and develops a live benchmarking platform to assess models in real-world scenarios.

Findings

01

LMMS-EVAL offers extensive task coverage but faces cost and contamination issues.

02

LMMS-EVAL LITE improves evaluation efficiency and coverage balance.

03

LIVEBENCH enables continuous, low-cost, zero-contamination model evaluation in real-world settings.

Abstract

The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMS-EVAL, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMS-EVAL offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMS-EVAL LITE, a pruned evaluation toolkit that emphasizes both coverage and efficiency. Additionally, we present Multimodal LIVEBENCH that utilizes continuously updating news and online forums to assess models' generalization abilities in the wild,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

evolvinglmms-lab/lmms-eval
pytorchOfficial

Datasets

lmms-lab/LiveBench
dataset· 1.5k dl
1.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems