What makes a good metric? Evaluating automatic metrics for text-to-image   consistency

Candace Ross; Melissa Hall; Adriana Romero Soriano; Adina Williams

arXiv:2412.13989·cs.CL·December 19, 2024

What makes a good metric? Evaluating automatic metrics for text-to-image consistency

Candace Ross, Melissa Hall, Adriana Romero Soriano, Adina Williams

PDF

Open Access

TL;DR

This paper critically evaluates four recent text-to-image consistency metrics, revealing their limitations in sensitivity and validity, and highlights the reliance on biases and shortcuts that undermine their effectiveness as evaluation tools.

Contribution

The study provides a comprehensive analysis of construct validity for current metrics, identifying key weaknesses and the need for more robust evaluation methods in text-to-image consistency.

Findings

01

No metric satisfies all validity criteria.

02

Metrics lack sensitivity to language and visual details.

03

VQA-based metrics rely on biases like yes-bias.

Abstract

Language models are increasingly being incorporated as components in larger AI systems for various purposes, from prompt optimization to automatic evaluation. In this work, we analyze the construct validity of four recent, commonly used methods for measuring text-to-image consistency - CLIPScore, TIFA, VPEval, and DSG - which rely on language models and/or VQA models as components. We define construct validity for text-image consistency metrics as a set of desiderata that text-image consistency metrics should have, and find that no tested metric satisfies all of them. We find that metrics lack sufficient sensitivity to language and visual properties. Next, we find that TIFA, VPEval and DSG contribute novel information above and beyond CLIPScore, but also that they correlate highly with each other. We also ablate different aspects of the text-image consistency metrics and find that not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques

MethodsSparse Evolutionary Training