MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

Jihan Yao; Yushi Hu; Yujie Yi; Bin Han; Shangbin Feng; Guang Yang; Bingbing Wen; Ranjay Krishna; Lucy Lu Wang; Yulia Tsvetkov; Noah A. Smith; Banghua Zhu

arXiv:2505.17613·cs.AI·May 26, 2025

MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

Jihan Yao, Yushi Hu, Yujie Yi, Bin Han, Shangbin Feng, Guang Yang, Bingbing Wen, Ranjay Krishna, Lucy Lu Wang, Yulia Tsvetkov, Noah A. Smith, Banghua Zhu

PDF

1 Datasets

TL;DR

MMMG is a comprehensive benchmark for evaluating multimodal generation models, aligning closely with human judgment and covering diverse tasks and modalities to identify strengths and gaps in current models.

Contribution

The paper introduces MMMG, a new benchmark with 49 tasks across four modalities, designed for reliable automatic and human-aligned evaluation of multimodal generation models.

Findings

01

MMMG achieves 94.3% agreement with human evaluation.

02

GPT Image reaches 78.3% accuracy in image generation.

03

Significant challenges remain in multimodal reasoning and audio generation.

Abstract

Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

UW-FMRL2/MMMG
dataset· 38 dl
38 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Cosine Annealing · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · ALIGN · Byte Pair Encoding · Layer Normalization · Dense Connections