Data value estimation on private gradients

Zijian Zhou; Xinyi Xu; Daniela Rus; Bryan Kian Hsiang Low

arXiv:2412.17008·cs.LG·December 24, 2024

Data value estimation on private gradients

Zijian Zhou, Xinyi Xu, Daniela Rus, Bryan Kian Hsiang Low

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the challenge of data valuation under differential privacy constraints in gradient-based ML, revealing issues with existing methods and proposing a correlated noise injection technique to improve data value estimation accuracy.

Contribution

It identifies the limitations of current data valuation methods under DP and introduces a novel correlated noise injection approach to enhance estimation accuracy.

Findings

01

Correlated noise injection improves data value estimation under DP.

02

Existing methods' uncertainty scales linearly with estimation budget.

03

Proposed method is effective in federated learning and dataset valuation.

Abstract

For gradient-based machine learning (ML) methods commonly adopted in practice such as stochastic gradient descent, the de facto differential privacy (DP) technique is perturbing the gradients with random Gaussian noise. Data valuation attributes the ML performance to the training data and is widely used in privacy-aware applications that require enforcing DP such as data pricing, collaborative ML, and federated learning (FL). Can existing data valuation methods still be used when DP is enforced via gradient perturbations? We show that the answer is no with the default approach of injecting i.i.d.~random noise to the gradients because the estimation uncertainty of the data value estimation paradoxically linearly scales with more estimation budget, producing estimates almost like random guesses. To address this issue, we propose to instead inject carefully correlated noise to provably…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 3

Strengths

This is an interesting topic, clearly connected to prior work. There is a lot of interest in estimating data values. The techniques here are non-trivial. It seems plausible to me that the experiments demonstrate a privacy-utility tradeoff that is acceptable for some uses.

Weaknesses

I see three main issues with the submission. I look forward to discussion with the authors and other reviewers. **First,** I did not understand why this notion of estimation uncertainty (Eq 3) is meaningful. An algorithm that always returns $\psi_j=0$ is perfectly private and has no variance. So I don't know how to interpret Proposition 5.3, which says that we can bound the variance by a constant. Is that good? Perhaps the correlated noise technique is simply bringing us closer to the $\psi_j=0

Reviewer 02Rating 6Confidence 3

Strengths

1. Both the motivation and problem setting are stated clearly, and the proposed adaptive mechanism provably beats the i.i.d. mechanism 2. comprehensive experiments are conducted to illstruate the proposed theory.

Weaknesses

1. The statement of assumptions is unclear. In Proposition 5.2, the authors state that the isotropic sub-gaussian assumption is made for the distribution but then introduce the covariance matrix $\Sigma$. Does isotropic mean that the covariance is the identity matrix?(e.g. Definition 3.2.1 in https://www.math.uci.edu/~rvershyn/papers/HDP-book/HDP-book.pdf ) 2. While I don't think the isotropic assumption is restrictive, I wonder whether the previously proposed binary counting mechanism( https:/

Reviewer 03Rating 6Confidence 4

Strengths

- It considers an interesting problem of data value attribution that has wide-ranging applications in many real-world scenarios. - The authors introduce the concept of estimation uncertainty and, using this metric, devise two algorithms for data value estimation. - The theoretical analysis of the estimation uncertainty of proposed algorithms shows that their method provably reduces the dependence on the number of evaluations from linear to log-squared k, where k is the number of evaluations. - T

Weaknesses

- The proposed method is somewhat simple, as it essentially computes a simple weighted average of previously released private gradients. While the simplicity of the proposed method does not imply a lack of novelty, the existence of prior work that also carefully generates the correlated noise to reduce the variance of released statistics suggests that there might be room for further improvement in the proposed approach. - In essence, the introduced estimation uncertainty is the variance of relea

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Data Management and Algorithms · Topological and Geometric Data Analysis