CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions

Chonghuinan Wang; Zihan Chen; Yuxiang Wei; Tianyi Jiang; Xiaohe Wu; Fan Li; Wangmeng Zuo; Hongxun Yao

arXiv:2603.26174·cs.CV·March 30, 2026

CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions

Chonghuinan Wang, Zihan Chen, Yuxiang Wei, Tianyi Jiang, Xiaohe Wu, Fan Li, Wangmeng Zuo, Hongxun Yao

PDF

1 Repo 1 Datasets

TL;DR

CREval introduces an automated, interpretable evaluation framework and benchmark for assessing creative image manipulation models under complex instructions, addressing gaps in existing evaluation methods.

Contribution

It presents CREval, a QA-based evaluation pipeline, and CREval-Bench, a comprehensive benchmark for complex, creative image editing tasks, with extensive evaluation results.

Findings

01

Closed-source models outperform open-source ones on complex tasks.

02

All models still face significant challenges in effective creative editing.

03

CREval's metrics strongly align with human judgments.

Abstract

Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chonghuinanwang/CREval
github

Datasets

ChonghuinanWang/CREval
dataset· 35 dl
35 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.