GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation
Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, Marjan Ghazvininejad

TL;DR
This paper investigates the problem of benchmark drift in Text-to-Image evaluation, demonstrates its impact on GenEval, and introduces GenEval 2 with an improved evaluation method, Soft-TIFA, to better align with human judgment and reduce drift.
Contribution
The paper identifies benchmark drift in T2I evaluation, analyzes its effects on GenEval, and proposes GenEval 2 with Soft-TIFA for more robust, human-aligned assessment.
Findings
GenEval has drifted significantly from human judgment over time.
GenEval 2 offers improved coverage and challenge for current models.
Soft-TIFA aligns better with human judgment and reduces drift.
Abstract
Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Ethics and Social Impacts of AI
