Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and   Human Ratings

Olivia Wiles; Chuhan Zhang; Isabela Albuquerque; Ivana Kaji\'c; Su; Wang; Emanuele Bugliarello; Yasumasa Onoe; Pinelopi Papalampidi; Ira Ktena,; Chris Knutsen; Cyrus Rashtchian; Anant Nawalgaria; Jordi Pont-Tuset; Aida; Nematzadeh

arXiv:2404.16820·cs.CV·March 18, 2025

Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

Olivia Wiles, Chuhan Zhang, Isabela Albuquerque, Ivana Kaji\'c, Su, Wang, Emanuele Bugliarello, Yasumasa Onoe, Pinelopi Papalampidi, Ira Ktena,, Chris Knutsen, Cyrus Rashtchian, Anant Nawalgaria, Jordi Pont-Tuset, Aida, Nematzadeh

PDF

Open Access 1 Repo

TL;DR

This paper systematically evaluates text-to-image models and metrics, introduces a skills-based benchmark for better model discrimination, and proposes a new auto-eval metric that aligns more closely with human judgments.

Contribution

It presents a comprehensive skills-based benchmark, extensive human rating data, and a novel QA-based auto-eval metric for improved T2I model evaluation.

Findings

01

The skills-based benchmark effectively discriminates models across different prompt complexities.

02

The collected human ratings reveal the impact of prompt ambiguity and model differences.

03

The new auto-eval metric correlates better with human ratings than existing metrics.

Abstract

While text-to-image (T2I) generative models have become ubiquitous, they do not necessarily generate images that align with a given prompt. While previous work has evaluated T2I alignment by proposing metrics, benchmarks, and templates for collecting human judgements, the quality of these components is not systematically measured. Human-rated prompt sets are generally small and the reliability of the ratings -- and thereby the prompt set used to compare models -- is not evaluated. We address this gap by performing an extensive study evaluating auto-eval metrics and human templates. We provide three main contributions: (1) We introduce a comprehensive skills-based benchmark that can discriminate models across different human templates. This skills-based benchmark categorises prompts into sub-skills, allowing a practitioner to pinpoint not only which skills are challenging, but at what…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-deepmind/gecko_benchmark_t2i
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpecies Distribution and Climate Change

MethodsSparse Evolutionary Training · ALIGN