LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, Ziwei Liu

TL;DR
This paper introduces LMMS-EVAL, a comprehensive benchmark framework for evaluating large multimodal models, addressing challenges of coverage, cost, and contamination, and proposes practical solutions like LMMS-EVAL LITE and LIVEBENCH for more reliable assessments.
Contribution
It presents a unified multimodal evaluation framework, introduces a pruned toolkit for efficiency, and develops a live benchmarking platform to assess models in real-world scenarios.
Findings
LMMS-EVAL offers extensive task coverage but faces cost and contamination issues.
LMMS-EVAL LITE improves evaluation efficiency and coverage balance.
LIVEBENCH enables continuous, low-cost, zero-contamination model evaluation in real-world settings.
Abstract
The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMS-EVAL, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMS-EVAL offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMS-EVAL LITE, a pruned evaluation toolkit that emphasizes both coverage and efficiency. Additionally, we present Multimodal LIVEBENCH that utilizes continuously updating news and online forums to assess models' generalization abilities in the wild,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
