AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters
Hanjun Luo, Zhimu Huang, Sylvia Chung, Yiran Wang, Yingbin Jin, Jialin Li, Jiang Li, Xinfeng Li, Hanan Salam

TL;DR
AtelierEval introduces a comprehensive benchmark and evaluation framework for assessing the prompting skills of humans and multimodal LLMs in text-to-image systems, addressing a previously unmeasured component.
Contribution
It presents the first unified benchmark and a skill-based evaluator for measuring upstream prompting proficiency in T2I systems, with extensive experiments comparing humans and models.
Findings
AtelierJudge achieves a Spearman correlation of 0.79 with human experts.
Benchmarking shows MLLMs outperform humans in mimicry over planning.
The framework reveals the importance of image-augmented prompting directions.
Abstract
Text-to-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts. Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured. We introduce AtelierEval, the first unified benchmark that quantifies prompting proficiency across 360 expert-crafted tasks. Grounded in a cognitive view, it spans three task categories and instantiates tasks using a taxonomy of real-world challenges, with a dual interface for both humans and MLLMs. To enable scalable and reliable evaluation, we propose AtelierJudge, a skill-based, memory-augmented agentic evaluator. It produces subjective and objective scores for prompt-image pairs, achieving a Spearman correlation of 0.79 with human experts, approaching human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
