LIME: Less Is More for MLLM Evaluation
King Zhu, Qianbo Zang, Shian Jia, Siwei Wu, Feiteng Fang, Yizhi Li,, Shawn Gavin, Tuney Zheng, Jiawei Guo, Bo Li, Haoning Wu, Xingwei Qu, Jian, Yang, Zachary Liu, Xiang Yue, J.H. Liu, Chenghua Lin, Min Yang, Shiwen Ni,, Wenhao Huang, Ge Zhang

TL;DR
LIME is a curated, efficient benchmark for evaluating multimodal large language models that filters out uninformative samples, reducing evaluation time and improving the assessment of model capabilities.
Contribution
We introduce LIME, a semi-automated pipeline that creates a more effective and efficient benchmark by filtering uninformative samples and focusing on image-based understanding tasks.
Findings
LIME reduces sample size by 76% and evaluation time by 77%.
Traditional metrics like CIDEr are inadequate for captioning evaluation.
Excluding caption scores improves overall model performance assessment.
Abstract
Multimodal Large Language Models (MLLMs) are evaluated on various benchmarks, such as image captioning, visual question answering, and reasoning. However, many of these benchmarks include overly simple or uninformative samples, complicating the effective distinction of different MLLMs' performance. Furthermore, evaluating models across numerous benchmarks incurs a significant computational burden. To address these issues, we propose LIME (Less Is More for MLLM Evaluation), a refined and efficient benchmark curated through a semi-automated pipeline. This pipeline filters out uninformative samples and eliminates answer leakage by focusing on tasks that necessitate image-based understanding. Our experiments indicate that LIME reduces the number of samples by 76% and evaluation time by 77%, while also providing a more effective means of distinguishing the capabilities of different models.…
Peer Reviews
Decision·Submitted to ICLR 2025
* Originality: * Novel approach to benchmark curation that focuses on quality over quantity. * Creative use of MLLMs themselves as judges for data filtering. * Innovative three-stage filtering pipeline (model judgment, semi-automated screening, leakage elimination) * Clarity: * Well-structured presentation of the methodology * Clear visualization of data statistics and filtering results * Quality: * Comprehensive empirical validation across multiple models and benchmarks
- The filtering pipeline heavily relies on existing MLLMs as judges, which could potentially introduce biases from these models into the benchmark. While the authors attempt to mitigate this by using multiple models, a more thorough analysis of potential inherited biases would strengthen the work. - The paper does not fully explore whether the reduced dataset size might affect the statistical significance of evaluation results. While efficiency gains are clear, more discussion of the tradeoffs b
The problem is important and interesting to the community. Evaluation is an important part for multimodal LLM. This work dives deep into existing benchmarks and conducts comprehensive analysis to study the specific questions in those benchmarks. The motivation of Figure 1 and 2 is clear and important.
1. My biggest concern is that the approach only filter the samples from the existing benchmarks, do we need to consider adding other metrics/domains to evaluate MLLMs? 2. Another interesting thing is that sometimes MLLM may not "read" image but directly answer the questions based on the knowledge from LLM, do we need to consider adding this into the benchmark?
1. This paper uncovers the problem of existing benchmarks and the proposed filter method is reasonable and meaningful. 2. The filter benchmark provides a more rigorous evaluation of the existing MLLMs and will have practical significance for future MLLM evaluations. 3. The experiment results are comprehensive and insightful.
1. Do not compare with other general MLLM benchmarks like MMMU or MMBench. I would also like to see whether the easy samples or answer-leakage samples exist in these benchmarks.
Code & Models
Videos
Taxonomy
TopicsLung Cancer Diagnosis and Treatment
MethodsLocal Interpretable Model-Agnostic Explanations
