CROC: Evaluating and Training T2I Metrics with Pseudo- and Human-Labeled Contrastive Robustness Checks

Christoph Leiter; Yuki M. Asano; Margret Keuper; Steffen Eger

arXiv:2505.11314·cs.CV·April 21, 2026

CROC: Evaluating and Training T2I Metrics with Pseudo- and Human-Labeled Contrastive Robustness Checks

Christoph Leiter, Yuki M. Asano, Margret Keuper, Steffen Eger

PDF

1 Models

TL;DR

CROC introduces a scalable framework for evaluating and training text-to-image metrics using synthetic and human-labeled contrastive checks, revealing robustness issues in current metrics.

Contribution

The paper presents CROC, a novel framework for automated robustness evaluation and training of T2I metrics, including a large pseudo-labeled dataset and a new state-of-the-art metric.

Findings

01

Many metrics fail on prompts involving negation.

02

All tested open-source metrics fail on at least 24% of cases involving body parts.

03

CROCScore achieves state-of-the-art performance among open-source metrics.

Abstract

The assessment of evaluation metrics (meta-evaluation) is crucial for determining the suitability of existing metrics in text-to-image (T2I) generation tasks. Human-based meta-evaluation is costly and time-intensive, and automated alternatives are scarce. We address this gap and propose CROC: a scalable framework for automated Contrastive Robustness Checks that systematically probes and quantifies metric robustness by synthesizing contrastive test cases across a comprehensive taxonomy of image properties. With CROC, we generate a pseudo-labeled dataset (CROC $^{sy n}$ ) of over 1 million contrastive prompt-image pairs to enable a fine-grained comparison of evaluation metrics. We also use this dataset to train CROCScore, a new metric that achieves state-of-the-art performance among open-source methods, demonstrating an additional key application of our framework. To complement this dataset,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
nllg/crocscore
model· 3 dl
3 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.