PQMass: Probabilistic Assessment of the Quality of Generative Models using Probability Mass Estimation
Pablo Lemos, Sammy Sharief, Nikolay Malkin, Salma Salhi, Connor Stone, Laurence Perreault-Levasseur, Yashar Hezaveh

TL;DR
PQMass is a likelihood-free, statistically rigorous method for evaluating and comparing generative models by dividing sample space into regions and applying chi-squared tests, effective across various data modalities and dimensions.
Contribution
Introduces PQMass, a novel likelihood-free assessment method that does not rely on density assumptions or auxiliary models for evaluating generative models.
Findings
Effective in assessing quality, novelty, and diversity of generated samples.
Scales well to moderately high-dimensional data.
Does not require feature extraction or density estimation.
Abstract
We propose a likelihood-free method for comparing two distributions given samples from each, with the goal of assessing the quality of generative models. The proposed approach, PQMass, provides a statistically rigorous method for assessing the performance of a single generative model or the comparison of multiple competing models. PQMass divides the sample space into non-overlapping regions and applies chi-squared tests to the number of data samples that fall within each region, giving a p-value that measures the probability that the bin counts derived from two sets of samples are drawn from the same multinomial distribution. PQMass does not depend on assumptions regarding the density of the true distribution, nor does it rely on training or fitting any auxiliary models. We evaluate PQMass on data of various modalities and dimensions, demonstrating its effectiveness in assessing the…
Peer Reviews
Decision·ICLR 2025 Poster
- PQMass introduces a new, likelihood-free approach for evaluating generative models by comparing sample distributions without density estimation or feature extraction, which is a fresh alternative to traditional metrics like FID and MMD. - The experiments are comprehensive, testing PQMass across various generative models, data types, and dimensions. The paper includes comparisons with established metrics, scalability tests, and ablation studies, which aims to show PQMass’s robustness, scalabil
- The use of L2 or L1 distance metrics may limit PQMass's effectiveness, especially when the data resides on a complex manifold. These metrics may not capture meaningful differences in such cases, potentially leading to inaccurate assessments. - Experimental evaluation mostly focus on synthetic data or standard datasets like MNIST and CIFAR-10. Testing on more complex, real-world datasets would strengthen PQMass’s claims about the performance of the proposal. - Although the paper briefly addr
The paper is well presented and understandable. A broad range of examples and simulations were presented demonstrating fairly broad applicability. The test statistic is a basic and well known statistical quantity, so the method is fast to implement.
Not clear if this is a weakness or a strength, but the proposed test is a well known, basic statistical test. The contribution of this paper, therefore, is to note that its easy to apply this test to a broad range of data types in order to detect a difference between the generating distribution of two samples. It's not clear if this is substantial enough a contribution for ICLR, though I do not wish to imply that substantial contributions need to involve complex methods. Consistency Guarantees:
The strength of the proposed method is in its simplicity. After discretizing the data space, one only needs to count the data points in each region and calculate the test statistic. This makes the method computationally much more efficient compared to existing distribution comparison methods, such as MMD.
A weakness of this study is the limited comparison with other two-sample test methods. The authors simplified the problem by dividing the data space and reducing it to a multinomial distribution comparison. However, various other methods have been proposed for distribution comparison not only MMD and W2. For example, [Ref1] explores distribution comparison using classifiers and its application to GAN evaluation, while simpler approaches, like those using nearest neighbors, are also available [Re
Videos
Taxonomy
TopicsSimulation Techniques and Applications · Scientific Computing and Data Management · Semantic Web and Ontologies
