GMValuator: Similarity-based Data Valuation for Generative Models

Jiaxi Yang; Wenglong Deng; Benlin Liu; Yangsibo Huang; James Zou,; Xiaoxiao Li

arXiv:2304.10701·cs.CV·April 15, 2025·1 cites

GMValuator: Similarity-based Data Valuation for Generative Models

Jiaxi Yang, Wenglong Deng, Benlin Liu, Yangsibo Huang, James Zou,, Xiaoxiao Li

PDF

Open Access 3 Reviews

TL;DR

GMValuator is a novel, training-free, and model-agnostic method for data valuation in generative models, using similarity matching and image quality assessment to efficiently evaluate training data contributions.

Contribution

It introduces GMValuator, the first approach for data valuation in generative models that is training-free, model-agnostic, and incorporates similarity matching and image quality metrics.

Findings

01

Effective on benchmark datasets

02

Works across various generative architectures

03

Outperforms existing methods in robustness and efficiency

Abstract

Data valuation plays a crucial role in machine learning. Existing data valuation methods, mainly focused on discriminative models, overlook generative models that have gained attention recently. In generative models, data valuation measures the impact of training data on generated datasets. Very few existing attempts at data valuation methods designed for deep generative models either concentrate on specific models or lack robustness in their outcomes. Moreover, efficiency still reveals vulnerable shortcomings. We formulate the data valuation problem in generative models from a similarity matching perspective to bridge the gaps. Specifically, we introduce Generative Model Valuator (GMValuator), the first training-free and model-agnostic approach to providing data valuation for image generation tasks. It empowers efficient data valuation through our innovative similarity matching module,…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The authors proposed GMVALUATOR to tackle the data valuation issue for generative models. GMVALUATOR is innovative, and model-agnostic, enabling broad applicability and adaptability across various generative models. Besides, GMVALUATOR does not require retraining of models, offering the advantage in R&D scenarios with limited computational resources. 2. The authors provide detailed theoretical justification for formulating data valuation for generative models as a similarity-matching problem

Weaknesses

1. The manuscript is not well written. For example, in the Introduction, before talking about the existing work, it is suggested to generally define/introduce the data valuation problem (including the input and the objective). Moreover, the authors didn't highlight the urgent need for data valuation in existing generative models; this poses a challenge to the motivation of this paper. Most importantly, instead of briefly introducing the principle of the proposed GMValuator (such as why and how t

Reviewer 02Rating 6Confidence 3

Strengths

- This is the first paper on data valuation on generative models. Previous data valuation methods focus on discriminative models and cannot adapt to generative models. - Compared to the retraining-based and influence-based methods, GMValuator is efficient. It does not require any retraining or computation of hessian. - GMValuator is effective on the proposed metrics. GMValuator has significantly improved compared to baseline methods.

Weaknesses

- For SOTA text-to-image models like stable diffusion, the image domain is much wider than the test models. As a result, a large number of generated images may be required for accurate data valuation. Meanwhile, generation with these models is slow. More results and ablation on stable diffusion on the SOTA text-to-image models would be helpful. - While the proposed metrics are intuitively reasonable, it is coarse-grained and may not be able to reflect the effectiveness of data evaluation methods

Reviewer 03Rating 8Confidence 4

Strengths

- The paper introduces a novel and intuitive idea for data valuation in generative models, and the results are promising. - The experiments are well-designed, exploring multiple distance functions and encoders to validate the approach. Also, multiple test scenarios were covered, all showing good supporting results for the proposed method. - The paper is well-written and easy to follow, effectively conveying the methodology and findings. - The paper covers relative literature well.

Weaknesses

- The impact of the quantization step on the final results is not explored. Understanding this effect would provide a clearer picture of the method’s performance. - While section 2 introduces some underlying assumptions and a theoretical motivation for using a similarity-guided data valuation score (illustrated in Figure 1), the framework would benefit from a more rigorous theoretical foundation. Further studies on theoretical support could strengthen the framework’s conceptual grounding and its

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Machine Learning and Data Classification · Advanced Neural Network Applications

MethodsDiffusion