Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation
Abdelrahman Eldesokey, Merey Ramazanova, Ahmad Sait, Ansar Khangeldin, Karen Sanchez, Tong Zhang, Bernard Ghanem

TL;DR
This paper proposes skill-aligned annotation strategies for more reliable and consistent evaluation of text-to-image models, demonstrating improved agreement and stability over traditional uniform methods.
Contribution
It introduces a skill-aligned annotation framework for T2I evaluation, showing it outperforms uniform annotation approaches and provides a scalable, automated evaluation pipeline.
Findings
Skill-aligned annotation yields higher inter-annotator agreement.
It improves evaluation stability across different models.
The automated pipeline enables scalable, fine-grained assessment.
Abstract
Text-to-image (T2I) generation has advanced rapidly, making reliable evaluation critical as performance differences between models narrow. Existing evaluation practices typically apply uniform annotation mechanisms, such as Likert-scale or binary question answering (BQA), across heterogeneous evaluation skills, despite fundamental differences in their nature. In this work, we revisit T2I evaluation through the lens of skill-aligned annotation, where annotation strategies reflect the underlying characteristics of each evaluation skill. We systematically compare skill-aligned annotation against uniform baselines and show that it produces more consistent evaluation signals, with higher inter-annotator agreement and improved stability across models. Finally, we present an automated pipeline that instantiates the proposed evaluation protocol, enabling scalable and fine-grained evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
