TL;DR
CROC introduces a scalable framework for evaluating and training text-to-image metrics using synthetic and human-labeled contrastive checks, revealing robustness issues in current metrics.
Contribution
The paper presents CROC, a novel framework for automated robustness evaluation and training of T2I metrics, including a large pseudo-labeled dataset and a new state-of-the-art metric.
Findings
Many metrics fail on prompts involving negation.
All tested open-source metrics fail on at least 24% of cases involving body parts.
CROCScore achieves state-of-the-art performance among open-source metrics.
Abstract
The assessment of evaluation metrics (meta-evaluation) is crucial for determining the suitability of existing metrics in text-to-image (T2I) generation tasks. Human-based meta-evaluation is costly and time-intensive, and automated alternatives are scarce. We address this gap and propose CROC: a scalable framework for automated Contrastive Robustness Checks that systematically probes and quantifies metric robustness by synthesizing contrastive test cases across a comprehensive taxonomy of image properties. With CROC, we generate a pseudo-labeled dataset (CROC) of over 1 million contrastive prompt-image pairs to enable a fine-grained comparison of evaluation metrics. We also use this dataset to train CROCScore, a new metric that achieves state-of-the-art performance among open-source methods, demonstrating an additional key application of our framework. To complement this dataset,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
