Re-Thinking the Automatic Evaluation of Image-Text Alignment in Text-to-Image Models

Huixuan Zhang; Xiaojun Wan

arXiv:2506.08480·cs.CL·June 11, 2025

Re-Thinking the Automatic Evaluation of Image-Text Alignment in Text-to-Image Models

Huixuan Zhang, Xiaojun Wan

PDF

Open Access

TL;DR

This paper critically examines current evaluation methods for image-text alignment in text-to-image models, revealing their shortcomings and proposing improvements to establish more trustworthy and comprehensive assessment frameworks.

Contribution

The paper identifies key properties for reliable evaluation, demonstrates the inadequacy of existing metrics, and offers recommendations to enhance evaluation practices.

Findings

01

Current evaluation metrics do not fully satisfy key properties of trustworthiness.

02

Existing frameworks show inconsistent agreement with human judgments.

03

Proposed recommendations aim to improve evaluation reliability.

Abstract

Text-to-image models often struggle to generate images that precisely match textual prompts. Prior research has extensively studied the evaluation of image-text alignment in text-to-image generation. However, existing evaluations primarily focus on agreement with human assessments, neglecting other critical properties of a trustworthy evaluation framework. In this work, we first identify two key aspects that a reliable evaluation should address. We then empirically demonstrate that current mainstream evaluation frameworks fail to fully satisfy these properties across a diverse range of metrics and models. Finally, we propose recommendations for improving image-text alignment evaluation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship

MethodsFocus