CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions
Chonghuinan Wang, Zihan Chen, Yuxiang Wei, Tianyi Jiang, Xiaohe Wu, Fan Li, Wangmeng Zuo, Hongxun Yao

TL;DR
CREval introduces an automated, interpretable evaluation framework and benchmark for assessing creative image manipulation models under complex instructions, addressing gaps in existing evaluation methods.
Contribution
It presents CREval, a QA-based evaluation pipeline, and CREval-Bench, a comprehensive benchmark for complex, creative image editing tasks, with extensive evaluation results.
Findings
Closed-source models outperform open-source ones on complex tasks.
All models still face significant challenges in effective creative editing.
CREval's metrics strongly align with human judgments.
Abstract
Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
