GMValuator: Similarity-based Data Valuation for Generative Models
Jiaxi Yang, Wenglong Deng, Benlin Liu, Yangsibo Huang, James Zou,, Xiaoxiao Li

TL;DR
GMValuator is a novel, training-free, and model-agnostic method for data valuation in generative models, using similarity matching and image quality assessment to efficiently evaluate training data contributions.
Contribution
It introduces GMValuator, the first approach for data valuation in generative models that is training-free, model-agnostic, and incorporates similarity matching and image quality metrics.
Findings
Effective on benchmark datasets
Works across various generative architectures
Outperforms existing methods in robustness and efficiency
Abstract
Data valuation plays a crucial role in machine learning. Existing data valuation methods, mainly focused on discriminative models, overlook generative models that have gained attention recently. In generative models, data valuation measures the impact of training data on generated datasets. Very few existing attempts at data valuation methods designed for deep generative models either concentrate on specific models or lack robustness in their outcomes. Moreover, efficiency still reveals vulnerable shortcomings. We formulate the data valuation problem in generative models from a similarity matching perspective to bridge the gaps. Specifically, we introduce Generative Model Valuator (GMValuator), the first training-free and model-agnostic approach to providing data valuation for image generation tasks. It empowers efficient data valuation through our innovative similarity matching module,…
Peer Reviews
Decision·ICLR 2025 Poster
1. The authors proposed GMVALUATOR to tackle the data valuation issue for generative models. GMVALUATOR is innovative, and model-agnostic, enabling broad applicability and adaptability across various generative models. Besides, GMVALUATOR does not require retraining of models, offering the advantage in R&D scenarios with limited computational resources. 2. The authors provide detailed theoretical justification for formulating data valuation for generative models as a similarity-matching problem
1. The manuscript is not well written. For example, in the Introduction, before talking about the existing work, it is suggested to generally define/introduce the data valuation problem (including the input and the objective). Moreover, the authors didn't highlight the urgent need for data valuation in existing generative models; this poses a challenge to the motivation of this paper. Most importantly, instead of briefly introducing the principle of the proposed GMValuator (such as why and how t
- This is the first paper on data valuation on generative models. Previous data valuation methods focus on discriminative models and cannot adapt to generative models. - Compared to the retraining-based and influence-based methods, GMValuator is efficient. It does not require any retraining or computation of hessian. - GMValuator is effective on the proposed metrics. GMValuator has significantly improved compared to baseline methods.
- For SOTA text-to-image models like stable diffusion, the image domain is much wider than the test models. As a result, a large number of generated images may be required for accurate data valuation. Meanwhile, generation with these models is slow. More results and ablation on stable diffusion on the SOTA text-to-image models would be helpful. - While the proposed metrics are intuitively reasonable, it is coarse-grained and may not be able to reflect the effectiveness of data evaluation methods
- The paper introduces a novel and intuitive idea for data valuation in generative models, and the results are promising. - The experiments are well-designed, exploring multiple distance functions and encoders to validate the approach. Also, multiple test scenarios were covered, all showing good supporting results for the proposed method. - The paper is well-written and easy to follow, effectively conveying the methodology and findings. - The paper covers relative literature well.
- The impact of the quantization step on the final results is not explored. Understanding this effect would provide a clearer picture of the method’s performance. - While section 2 introduces some underlying assumptions and a theoretical motivation for using a similarity-guided data valuation score (illustrated in Figure 1), the framework would benefit from a more rigorous theoretical foundation. Further studies on theoretical support could strengthen the framework’s conceptual grounding and its
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning and Data Classification · Advanced Neural Network Applications
MethodsDiffusion
