Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models
George Stein, Jesse C. Cresswell, Rasa Hosseinzadeh, Yi Sui, Brendan, Leigh Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L. Caterini, J. Eric T., Taylor, Gabriel Loaiza-Ganem

TL;DR
This paper critically examines the effectiveness of current evaluation metrics for generative models, revealing significant discrepancies with human perception, and proposes improved evaluation methods using self-supervised features and better detection of memorization.
Contribution
It identifies flaws in existing metrics, demonstrates their poor correlation with human judgment, and introduces alternative evaluation techniques using self-supervised models like DINOv2-ViT-L/14.
Findings
Current metrics do not correlate well with human perception of realism.
Diffusion models' perceptual quality is underestimated by common metrics.
Existing metrics fail to detect memorization in generative models.
Abstract
We systematically study a wide variety of generative models spanning semantically-diverse image datasets to understand and improve the feature extractors and metrics used to evaluate them. Using best practices in psychophysics, we measure human perception of image realism for generated samples by conducting the largest experiment evaluating generative models to date, and find that no existing metric strongly correlates with human evaluations. Comparing to 17 modern metrics for evaluating the overall performance, fidelity, diversity, rarity, and memorization of generative models, we find that the state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics such as FID. This discrepancy is not explained by diversity in generated samples, though one cause is over-reliance on Inception-V3. We address these flaws through a study of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis
MethodsLib · Diffusion
