LIME: Less Is More for MLLM Evaluation

King Zhu; Qianbo Zang; Shian Jia; Siwei Wu; Feiteng Fang; Yizhi Li,; Shawn Gavin; Tuney Zheng; Jiawei Guo; Bo Li; Haoning Wu; Xingwei Qu; Jian; Yang; Zachary Liu; Xiang Yue; J.H. Liu; Chenghua Lin; Min Yang; Shiwen Ni,; Wenhao Huang; Ge Zhang

arXiv:2409.06851·cs.CV·October 15, 2024

LIME: Less Is More for MLLM Evaluation

King Zhu, Qianbo Zang, Shian Jia, Siwei Wu, Feiteng Fang, Yizhi Li,, Shawn Gavin, Tuney Zheng, Jiawei Guo, Bo Li, Haoning Wu, Xingwei Qu, Jian, Yang, Zachary Liu, Xiang Yue, J.H. Liu, Chenghua Lin, Min Yang, Shiwen Ni,, Wenhao Huang, Ge Zhang

PDF

Open Access 2 Repos 1 Video 3 Reviews

TL;DR

LIME is a curated, efficient benchmark for evaluating multimodal large language models that filters out uninformative samples, reducing evaluation time and improving the assessment of model capabilities.

Contribution

We introduce LIME, a semi-automated pipeline that creates a more effective and efficient benchmark by filtering uninformative samples and focusing on image-based understanding tasks.

Findings

01

LIME reduces sample size by 76% and evaluation time by 77%.

02

Traditional metrics like CIDEr are inadequate for captioning evaluation.

03

Excluding caption scores improves overall model performance assessment.

Abstract

Multimodal Large Language Models (MLLMs) are evaluated on various benchmarks, such as image captioning, visual question answering, and reasoning. However, many of these benchmarks include overly simple or uninformative samples, complicating the effective distinction of different MLLMs' performance. Furthermore, evaluating models across numerous benchmarks incurs a significant computational burden. To address these issues, we propose LIME (Less Is More for MLLM Evaluation), a refined and efficient benchmark curated through a semi-automated pipeline. This pipeline filters out uninformative samples and eliminates answer leakage by focusing on tasks that necessitate image-based understanding. Our experiments indicate that LIME reduces the number of samples by 76% and evaluation time by 77%, while also providing a more effective means of distinguishing the capabilities of different models.…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

* Originality: * Novel approach to benchmark curation that focuses on quality over quantity. * Creative use of MLLMs themselves as judges for data filtering. * Innovative three-stage filtering pipeline (model judgment, semi-automated screening, leakage elimination) * Clarity: * Well-structured presentation of the methodology * Clear visualization of data statistics and filtering results * Quality: * Comprehensive empirical validation across multiple models and benchmarks

Weaknesses

- The filtering pipeline heavily relies on existing MLLMs as judges, which could potentially introduce biases from these models into the benchmark. While the authors attempt to mitigate this by using multiple models, a more thorough analysis of potential inherited biases would strengthen the work. - The paper does not fully explore whether the reduced dataset size might affect the statistical significance of evaluation results. While efficiency gains are clear, more discussion of the tradeoffs b

Reviewer 02Rating 5Confidence 3

Strengths

The problem is important and interesting to the community. Evaluation is an important part for multimodal LLM. This work dives deep into existing benchmarks and conducts comprehensive analysis to study the specific questions in those benchmarks. The motivation of Figure 1 and 2 is clear and important.

Weaknesses

1. My biggest concern is that the approach only filter the samples from the existing benchmarks, do we need to consider adding other metrics/domains to evaluate MLLMs? 2. Another interesting thing is that sometimes MLLM may not "read" image but directly answer the questions based on the knowledge from LLM, do we need to consider adding this into the benchmark?

Reviewer 03Rating 8Confidence 4

Strengths

1. This paper uncovers the problem of existing benchmarks and the proposed filter method is reasonable and meaningful. 2. The filter benchmark provides a more rigorous evaluation of the existing MLLMs and will have practical significance for future MLLM evaluations. 3. The experiment results are comprehensive and insightful.

Weaknesses

1. Do not compare with other general MLLM benchmarks like MMMU or MMBench. I would also like to see whether the easy samples or answer-leakage samples exist in these benchmarks.

Code & Models

Repositories

Videos

LIME: Less Is More for MLLM Evaluation· underline

Taxonomy

TopicsLung Cancer Diagnosis and Treatment

MethodsLocal Interpretable Model-Agnostic Explanations