Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation
Isabela Albuquerque, Ira Ktena, Olivia Wiles, Ivana Kaji\'c, Amal Rannen-Triki, Cristina Vasconcelos, Aida Nematzadeh

TL;DR
This paper presents a comprehensive human evaluation framework for measuring and comparing diversity in text-to-image models, addressing the challenge of homogeneous outputs and providing insights for future improvements.
Contribution
It introduces a novel human evaluation template, curated prompt sets, and a methodology for model comparison, advancing diversity assessment in T2I models.
Findings
Effective ranking of models by diversity
Identification of categories where models struggle
Comparison of image embeddings for diversity measurement
Abstract
Despite advances in generation quality, current text-to-image (T2I) models often lack diversity, generating homogeneous outputs. This work introduces a framework to address the need for robust diversity evaluation in T2I models. Our framework systematically assesses diversity by evaluating individual concepts and their relevant factors of variation. Key contributions include: (1) a novel human evaluation template for nuanced diversity assessment; (2) a curated prompt set covering diverse concepts with their identified factors of variation (e.g. prompt: An image of an apple, factor of variation: color); and (3) a methodology for comparing models in terms of human annotations via binomial tests. Furthermore, we rigorously compare various image embeddings for diversity measurement. Notably, our principled approach enables ranking of T2I models by diversity, identifying categories where…
Peer Reviews
Decision·Submitted to ICLR 2026
S1: The proposed idea is quite interesting to me. The problem formulation helps to de-confound diversity. They formalize per-attribute diversity and explain why generic prompts or generic human templates mix up fidelity or content variation with true diversity. This approach is both conceptually clear and practical to implement. S2: The human study design and the aggregation or statistical methods are serious and thorough. There are 24591 annotations across 5 models, with majority-vote aggregat
W1: The scope and external validity of the concepts and attributes are the main concern. The prompt set focuses on people-excluded, everyday, ImageNet-like concepts and explicitly leaves out person categories. While suitable for a proof of concept, this skips socially sensitive forms of diversity such as demographics or geographic context, as well as open-world composition where multiple attributes interact, like material combined with style and function. The selection is based on LLM-proposed a
The authors correctly identify a major gap, which is the lack of a principled, attribute-grounded approach to diversity evaluation. Extensive annotation effort, containing 240000 samples with high inter-annotator reliability α > 0.8, provides a strong empirical foundation.
The prompt generation pipeline still requires extensive human verification and filtering despite LLM assistance, which limits scalability and reproducibility across domains. The framework focuses primarily on visual diversity, without considering semantic or contextual diversity dimensions that may better reflect real-world generative quality. The evaluation of automatic metrics is narrow, emphasizing the Vendi Score while omitting comparisons to other recent or theoretically grounded diversit
1) Attribute-conditioned, count-anchored human evaluation reduces ambiguity and improves rater reliability; the procedure is easy to replicate. 2) Comparison shows image-only embeddings with Vendi track human judgments on large-gap cases, offering a usable baseline for fast iteration.
1) Limited Novelty; contributions are primarily protocol + dataset + empirical comparisons. 2) The paper does not thoroughly analyze where autoraters fail (e.g., ties and small gaps, rare attribute values, per-concept heatmaps or illustrative disagreements).
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Ethics and Social Impacts of AI · Mobile Crowdsensing and Crowdsourcing
