MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

Jiacheng Chen; Tianhao Liang; Sherman Siu; Zhengqing Wang; Kai Wang; Yubo Wang; Yuansheng Ni; Wang Zhu; Ziyan Jiang; Bohan Lyu; Dongfu Jiang; Xuan He; Yuan Liu; Hexiang Hu; Xiang Yue; Wenhu Chen

arXiv:2410.10563·cs.CV·July 15, 2025

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Wang Zhu, Ziyan Jiang, Bohan Lyu, Dongfu Jiang, Xuan He, Yuan Liu, Hexiang Hu, Xiang Yue, Wenhu Chen

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

MEGA-Bench introduces a comprehensive evaluation suite with over 500 diverse real-world multimodal tasks, enabling detailed assessment of model capabilities across multiple formats and dimensions.

Contribution

It provides the first large-scale, heterogeneous multimodal benchmark with diverse output formats and over 40 metrics for in-depth model evaluation.

Findings

01

Models show varied strengths across different task types.

02

MEGA-Bench enables detailed capability profiling of vision-language models.

03

Benchmark covers a wide range of real-world multimodal applications.

Abstract

We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MMBench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

# Overall assessment This work presents an interesting contribution in a much-needed space (benchmarks for multimodal large models). To address the current scattershot approach to multimodal model benchmarking, the authors attempt to create a single, highly diverse, comprehensive benchmark for a variety of image-language tasks (including video). To construct the benchmark the authors develop and refine a task taxonomy, but some details around the taxonomy and its construction are unclear. I hav

Weaknesses

see above

Reviewer 02Rating 8Confidence 3

Strengths

1. MEGA-BENCH has a large scale and coverage, containing over 500 diverse real-world tasks, which allows for an in-depth assessment of multimodal models across various applications and skills. 2. It offers a sophisticated, fine-grained analysis capability by categorizing tasks along multiple dimensions, providing a nuanced understanding of model performance in specific areas and revealing strengths and weaknesses that aggregate scores might obscure. 3. The benchmark's design emphasizes cost-ef

Weaknesses

1. While MEGA-BENCH offers a vast array of tasks, its large scale may lead to increased computational costs and complexity in evaluation, potentially limiting its accessibility for further research and extensive exploration. 2. MEGA-BENCH's focus on breadth may result in some tasks being too specific or niche, which could limit the generalizability of the benchmark results to a broader range of multimodal problems and applications.

Reviewer 03Rating 8Confidence 3

Strengths

S1: The proposed open-source benchmark includes a large number of diverse tasks for LLMs that can potentially address the limitations of existing benchmarks. It provides valuable resource for the community. S2: The paper also provides an extensive experiment and analysis of popular LLMs using Mega Bench. It yields many interesting findings. S3: This paper is well-written and easy to read.

Weaknesses

### Major weaknesses W1: The rationale behind the task taxonomy tree is not well-explained. Section 3.1 can be strengthened by discussing the design considerations for the draft taxonomy tree. For example, why do we want perception, planning, reasoning? Are these the limitations of existing benchmarks? How do we know this taxonomy is comprehensive and reflects the real usage of LLMs? W2: The introduction highlights Mega Bench's contributions in multimodal tasks. However, there is limited infor

Code & Models

Repositories

TIGER-AI-Lab/MEGA-Bench
noneOfficial

Datasets

TIGER-Lab/MEGA-Bench
dataset· 222 dl
222 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems

MethodsSparse Evolutionary Training