Exposing flaws of generative model evaluation metrics and their unfair   treatment of diffusion models

George Stein; Jesse C. Cresswell; Rasa Hosseinzadeh; Yi Sui; Brendan; Leigh Ross; Valentin Villecroze; Zhaoyan Liu; Anthony L. Caterini; J. Eric T.; Taylor; Gabriel Loaiza-Ganem

arXiv:2306.04675·cs.LG·December 6, 2023·21 cites

Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

George Stein, Jesse C. Cresswell, Rasa Hosseinzadeh, Yi Sui, Brendan, Leigh Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L. Caterini, J. Eric T., Taylor, Gabriel Loaiza-Ganem

PDF

Open Access 4 Repos 1 Models 1 Video

TL;DR

This paper critically examines the effectiveness of current evaluation metrics for generative models, revealing significant discrepancies with human perception, and proposes improved evaluation methods using self-supervised features and better detection of memorization.

Contribution

It identifies flaws in existing metrics, demonstrates their poor correlation with human judgment, and introduces alternative evaluation techniques using self-supervised models like DINOv2-ViT-L/14.

Findings

01

Current metrics do not correlate well with human perception of realism.

02

Diffusion models' perceptual quality is underestimated by common metrics.

03

Existing metrics fail to detect memorization in generative models.

Abstract

We systematically study a wide variety of generative models spanning semantically-diverse image datasets to understand and improve the feature extractors and metrics used to evaluate them. Using best practices in psychophysics, we measure human perception of image realism for generated samples by conducting the largest experiment evaluating generative models to date, and find that no existing metric strongly correlates with human evaluations. Comparing to 17 modern metrics for evaluating the overall performance, fidelity, diversity, rarity, and memorization of generative models, we find that the state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics such as FID. This discrepancy is not explained by diversity in generated samples, though one cause is over-reliance on Inception-V3. We address these flaws through a study of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
jimmycarter/LibreFLUX
model· 62 dl· ♡ 173
62 dl♡ 173

Videos

Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis

MethodsLib · Diffusion