MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Wang Zhu, Ziyan Jiang, Bohan Lyu, Dongfu Jiang, Xuan He, Yuan Liu, Hexiang Hu, Xiang Yue, Wenhu Chen

TL;DR
MEGA-Bench introduces a comprehensive evaluation suite with over 500 diverse real-world multimodal tasks, enabling detailed assessment of model capabilities across multiple formats and dimensions.
Contribution
It provides the first large-scale, heterogeneous multimodal benchmark with diverse output formats and over 40 metrics for in-depth model evaluation.
Findings
Models show varied strengths across different task types.
MEGA-Bench enables detailed capability profiling of vision-language models.
Benchmark covers a wide range of real-world multimodal applications.
Abstract
We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MMBench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability…
Peer Reviews
Decision·ICLR 2025 Poster
# Overall assessment This work presents an interesting contribution in a much-needed space (benchmarks for multimodal large models). To address the current scattershot approach to multimodal model benchmarking, the authors attempt to create a single, highly diverse, comprehensive benchmark for a variety of image-language tasks (including video). To construct the benchmark the authors develop and refine a task taxonomy, but some details around the taxonomy and its construction are unclear. I hav
see above
1. MEGA-BENCH has a large scale and coverage, containing over 500 diverse real-world tasks, which allows for an in-depth assessment of multimodal models across various applications and skills. 2. It offers a sophisticated, fine-grained analysis capability by categorizing tasks along multiple dimensions, providing a nuanced understanding of model performance in specific areas and revealing strengths and weaknesses that aggregate scores might obscure. 3. The benchmark's design emphasizes cost-ef
1. While MEGA-BENCH offers a vast array of tasks, its large scale may lead to increased computational costs and complexity in evaluation, potentially limiting its accessibility for further research and extensive exploration. 2. MEGA-BENCH's focus on breadth may result in some tasks being too specific or niche, which could limit the generalizability of the benchmark results to a broader range of multimodal problems and applications.
S1: The proposed open-source benchmark includes a large number of diverse tasks for LLMs that can potentially address the limitations of existing benchmarks. It provides valuable resource for the community. S2: The paper also provides an extensive experiment and analysis of popular LLMs using Mega Bench. It yields many interesting findings. S3: This paper is well-written and easy to read.
### Major weaknesses W1: The rationale behind the task taxonomy tree is not well-explained. Section 3.1 can be strengthened by discussing the design considerations for the draft taxonomy tree. For example, why do we want perception, planning, reasoning? Are these the limitations of existing benchmarks? How do we know this taxonomy is comprehensive and reflects the real usage of LLMs? W2: The introduction highlights Mega Bench's contributions in multimodal tasks. However, there is limited infor
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
MethodsSparse Evolutionary Training
