Re-Thinking the Automatic Evaluation of Image-Text Alignment in Text-to-Image Models
Huixuan Zhang, Xiaojun Wan

TL;DR
This paper critically examines current evaluation methods for image-text alignment in text-to-image models, revealing their shortcomings and proposing improvements to establish more trustworthy and comprehensive assessment frameworks.
Contribution
The paper identifies key properties for reliable evaluation, demonstrates the inadequacy of existing metrics, and offers recommendations to enhance evaluation practices.
Findings
Current evaluation metrics do not fully satisfy key properties of trustworthiness.
Existing frameworks show inconsistent agreement with human judgments.
Proposed recommendations aim to improve evaluation reliability.
Abstract
Text-to-image models often struggle to generate images that precisely match textual prompts. Prior research has extensively studied the evaluation of image-text alignment in text-to-image generation. However, existing evaluations primarily focus on agreement with human assessments, neglecting other critical properties of a trustworthy evaluation framework. In this work, we first identify two key aspects that a reliable evaluation should address. We then empirically demonstrate that current mainstream evaluation frameworks fail to fully satisfy these properties across a diverse range of metrics and models. Finally, we propose recommendations for improving image-text alignment evaluation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship
MethodsFocus
