GRADE: Quantifying Sample Diversity in Text-to-Image Models
Royi Rassin, Aviv Slobodkin, Shauli Ravfogel, Yanai Elazar, Yoav, Goldberg

TL;DR
GRADE is an automatic method that quantifies sample diversity in text-to-image models by leveraging language and visual reasoning, revealing limited variation and default behaviors across models.
Contribution
It introduces GRADE, a novel, automatic, semantically-driven approach to measure diversity in text-to-image models using entropy and concept-specific axes.
Findings
Models show limited diversity and default behaviors.
Stronger models exhibit more deterioration in diversity.
Underspecified captions contribute to low diversity.
Abstract
We introduce GRADE, an automatic method for quantifying sample diversity in text-to-image models. Our method leverages the world knowledge embedded in large language models and visual question-answering systems to identify relevant concept-specific axes of diversity (e.g., ``shape'' for the concept ``cookie''). It then estimates frequency distributions of concepts and their attributes and quantifies diversity using entropy. We use GRADE to measure the diversity of 12 models over a total of 720K images, revealing that all models display limited variation, with clear deterioration in stronger models. Further, we find that models often exhibit default behaviors, a phenomenon where a model consistently generates concepts with the same attributes (e.g., 98% of the cookies are round). Lastly, we show that a key reason for low diversity is underspecified captions in training data. Our work…
Peer Reviews
Decision·Submitted to ICLR 2025
The strength of this paper is introducing the fine-grained and interpretable metric that overcomes the limitations of traditional diversity metrics by quantifying concept-specific attribute diversity without relying on reference images. It provides deeper insights into text-to-image model behavior and highlights the impact of training data biases, offering a clear path for advancing generative model evaluation and diversity.
1. Generally, there is a common perception that "cookies" are round, and I share this view. If a square cookie were requested but a round cookie was generated, that would indeed be an issue. However, the problem highlighted by the authors is not about such cases but rather challenges the generalization itself. I find it difficult to relate why generating results that align with common sense is problematic. In other words, it seems the authors are raising an issue with what is an expected outcome
- S1: This work tackles the extremely relevant open research question of evaluating diversity of text-to-images. 
 - S2: The proposed approach, GRADE, proposes to measure diversity in terms of images of specific concepts with respect to relevant factors of variation. 
 - S3: GRADE is a reference-free metric and doesn’t rely on training any new models to be computed.
- W1: The proposed metric does not seem to be effective at discriminating between models. It is not clear what is the statistical significance in the difference of scores reported in Table 4. Given the average and standard deviation of GRADE, it seems that confidence intervals for all models would overlap. 
 - W1.1: There is no rigorous validation of the proposed metric and given that it seems GRADE is not able to properly distinguish 12 generative models, it is unclear to me wheth
- This paper is well-written, well-motivated, and well-organized. - A key strength is that they provide a clear and structured definition of the diversity metric and the systematic approach in Sec. 3. The method estimates distributions with entropy, and allows for a granular and concept-specific evaluation of the diversity. - The paper links low diversity in generated images to underspecified captions in training data, suggesting that this lack of diversity might come from biases in the data i
- Since GPT-4o is used to generate prompts, attributes, and attribute values, its lack of version control could affect the consistency of the GRADE framework as GPT-4o evolves. This may alter diversity analysis results, making reproducibility difficult. Changes in GPT-4o’s outputs could lead to inconsistencies when comparing diversity evaluations of T2I models evaluated at different points in time. - The definition of diversity in this work is focused on specific attribute variations (like shape
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods
