Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation

Isabela Albuquerque; Ira Ktena; Olivia Wiles; Ivana Kaji\'c; Amal Rannen-Triki; Cristina Vasconcelos; Aida Nematzadeh

arXiv:2511.10547·cs.CV·November 14, 2025

Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation

Isabela Albuquerque, Ira Ktena, Olivia Wiles, Ivana Kaji\'c, Amal Rannen-Triki, Cristina Vasconcelos, Aida Nematzadeh

PDF

Open Access 3 Reviews

TL;DR

This paper presents a comprehensive human evaluation framework for measuring and comparing diversity in text-to-image models, addressing the challenge of homogeneous outputs and providing insights for future improvements.

Contribution

It introduces a novel human evaluation template, curated prompt sets, and a methodology for model comparison, advancing diversity assessment in T2I models.

Findings

01

Effective ranking of models by diversity

02

Identification of categories where models struggle

03

Comparison of image embeddings for diversity measurement

Abstract

Despite advances in generation quality, current text-to-image (T2I) models often lack diversity, generating homogeneous outputs. This work introduces a framework to address the need for robust diversity evaluation in T2I models. Our framework systematically assesses diversity by evaluating individual concepts and their relevant factors of variation. Key contributions include: (1) a novel human evaluation template for nuanced diversity assessment; (2) a curated prompt set covering diverse concepts with their identified factors of variation (e.g. prompt: An image of an apple, factor of variation: color); and (3) a methodology for comparing models in terms of human annotations via binomial tests. Furthermore, we rigorously compare various image embeddings for diversity measurement. Notably, our principled approach enables ranking of T2I models by diversity, identifying categories where…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

S1: The proposed idea is quite interesting to me. The problem formulation helps to de-confound diversity. They formalize per-attribute diversity and explain why generic prompts or generic human templates mix up fidelity or content variation with true diversity. This approach is both conceptually clear and practical to implement. S2: The human study design and the aggregation or statistical methods are serious and thorough. There are 24591 annotations across 5 models, with majority-vote aggregat

Weaknesses

W1: The scope and external validity of the concepts and attributes are the main concern. The prompt set focuses on people-excluded, everyday, ImageNet-like concepts and explicitly leaves out person categories. While suitable for a proof of concept, this skips socially sensitive forms of diversity such as demographics or geographic context, as well as open-world composition where multiple attributes interact, like material combined with style and function. The selection is based on LLM-proposed a

Reviewer 02Rating 4Confidence 3

Strengths

The authors correctly identify a major gap, which is the lack of a principled, attribute-grounded approach to diversity evaluation. Extensive annotation effort, containing 240000 samples with high inter-annotator reliability α > 0.8, provides a strong empirical foundation.

Weaknesses

The prompt generation pipeline still requires extensive human verification and filtering despite LLM assistance, which limits scalability and reproducibility across domains. The framework focuses primarily on visual diversity, without considering semantic or contextual diversity dimensions that may better reflect real-world generative quality. The evaluation of automatic metrics is narrow, emphasizing the Vendi Score while omitting comparisons to other recent or theoretically grounded diversit

Reviewer 03Rating 4Confidence 4

Strengths

1) Attribute-conditioned, count-anchored human evaluation reduces ambiguity and improves rater reliability; the procedure is easy to replicate. 2) Comparison shows image-only embeddings with Vendi track human judgments on large-gap cases, offering a usable baseline for fast iteration.

Weaknesses

1) Limited Novelty; contributions are primarily protocol + dataset + empirical comparisons. 2) The paper does not thoroughly analyze where autoraters fail (e.g., ties and small gaps, rare attribute values, per-concept heatmaps or illustrative disagreements).

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Ethics and Social Impacts of AI · Mobile Crowdsensing and Crowdsourcing