ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

Samin Mahdizadeh Sani; Max Ku; Nima Jamali; Matina Mahdizadeh Sani; Paria Khoshtab; Wei-Chieh Sun; Parnian Fazel; Zhi Rui Tam; Thomas Chong; Edisy Kin Wai Chan; Donald Wai Tong Tsang; Chiao-Wei Hsu; Ting Wai Lam; Ho Yin Sam Ng; Chiafeng Chu; Chak-Wing Mak; Keming Wu; Hiu Tung Wong; Yik Chun Ho; Chi Ruan; Zhuofeng Li; I-Sheng Fang; Shih-Ying Yeh; Ho Kei Cheng; Ping Nie; Wenhu Chen

arXiv:2603.27862·cs.GR·March 31, 2026

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, Donald Wai Tong Tsang, Chiao-Wei Hsu, Ting Wai Lam, Ho Yin Sam Ng, Chiafeng Chu, Chak-Wing Mak, Keming Wu

PDF

1 Repo 1 Video

TL;DR

ImagenWorld is a comprehensive benchmark with human annotations and explainable evaluation for assessing and diagnosing the performance of image generation models across diverse tasks and domains.

Contribution

Introduces ImagenWorld, a large-scale, multi-task benchmark with explainable error tagging and human annotations to evaluate image generation models comprehensively.

Findings

01

Models struggle more with editing than generation, especially local edits.

02

Artistic and photorealistic models perform better than in symbolic domains.

03

Closed-source models outperform open-source ones overall.

Abstract

Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce \textbf{ImagenWorld}, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tiger-ai-lab/ImagenWorld
github

Videos

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks· slideslive