A note on the evaluation of generative models
Lucas Theis, A\"aron van den Oord, Matthias Bethge

TL;DR
This paper critically examines common evaluation metrics for image generative models, highlighting their independence and limitations, and emphasizes the importance of application-specific evaluation methods.
Contribution
It clarifies the independence of evaluation criteria and advises against using Parzen window estimates, promoting direct, application-oriented evaluation of generative models.
Findings
Average log-likelihood, Parzen estimates, and visual fidelity are largely independent.
Good performance in one metric does not imply good performance in others.
Parzen window estimates should generally be avoided.
Abstract
Probabilistic generative models can be used for compression, denoising, inpainting, texture synthesis, semi-supervised learning, unsupervised feature learning, and other tasks. Given this wide range of applications, it is not surprising that a lot of heterogeneity exists in the way these models are formulated, trained, and evaluated. As a consequence, direct comparison between models is often difficult. This article reviews mostly known but often underappreciated properties relating to the evaluation and interpretation of generative models with a focus on image models. In particular, we show that three of the currently most commonly used criteria---average log-likelihood, Parzen window estimates, and visual fidelity of samples---are largely independent of each other when the data is high-dimensional. Good performance with respect to one criterion therefore need not imply good…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Music and Audio Processing · Video Analysis and Summarization
