MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models

Hang Hua; Ziyun Zeng; Yizhi Song; Yunlong Tang; Liu He; Daniel Aliaga; Wei Xiong; Jiebo Luo

arXiv:2505.19415·cs.CV·May 29, 2025

MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models

Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel Aliaga, Wei Xiong, Jiebo Luo

PDF

Open Access 1 Datasets

TL;DR

MMIG-Bench is a comprehensive and explainable benchmark for evaluating multi-modal image generation models, integrating diverse tasks and introducing new metrics to better assess quality, alignment, and human preferences.

Contribution

It introduces MMIG-Bench, a unified multi-modal image generation benchmark with novel evaluation metrics, including an Aspect Matching Score, and provides extensive benchmarking of 17 models.

Findings

01

The Aspect Matching Score correlates strongly with human judgments.

02

MMIG-Bench reveals strengths and weaknesses of current models.

03

Benchmarking results offer insights into architecture and data impacts.

Abstract

Recent multimodal image generators such as GPT-4o, Gemini 2.0 Flash, and Gemini 2.5 Pro excel at following complex instructions, editing images and maintaining concept consistency. However, they are still evaluated by disjoint toolkits: text-to-image (T2I) benchmarks that lacks multi-modal conditioning, and customized image generation benchmarks that overlook compositional semantics and common knowledge. We propose MMIG-Bench, a comprehensive Multi-Modal Image Generation Benchmark that unifies these tasks by pairing 4,850 richly annotated text prompts with 1,750 multi-view reference images across 380 subjects, spanning humans, animals, objects, and artistic styles. MMIG-Bench is equipped with a three-level evaluation framework: (1) low-level metrics for visual artifacts and identity preservation of objects; (2) novel Aspect Matching Score (AMS): a VQA-based mid-level metric that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

hhua2/MMIG-Bench
dataset· 14 dl
14 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Aesthetic Perception and Analysis