Does Training with Synthetic Data Truly Protect Privacy?

Yunpeng Zhao; Jie Zhang

arXiv:2502.12976·cs.CR·February 19, 2025

Does Training with Synthetic Data Truly Protect Privacy?

Yunpeng Zhao, Jie Zhang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper critically examines whether training with synthetic data genuinely safeguards privacy, highlighting that empirical methods lack rigorous guarantees and may falsely suggest privacy protection.

Contribution

The paper analyzes four training paradigms using synthetic data, revealing significant differences and emphasizing the need for rigorous privacy evaluation.

Findings

01

Different synthetic data methods vary greatly in privacy implications

02

Empirical privacy claims can be misleading without formal guarantees

03

Rigorous evaluation is essential for trustworthy privacy protection

Abstract

As synthetic data becomes increasingly popular in machine learning tasks, numerous methods--without formal differential privacy guarantees--use synthetic data for training. These methods often claim, either explicitly or implicitly, to protect the privacy of the original training data. In this work, we explore four different training paradigms: coreset selection, dataset distillation, data-free knowledge distillation, and synthetic data generated from diffusion models. While all these methods utilize synthetic data for training, they lead to vastly different conclusions regarding privacy preservation. We caution that empirical approaches to preserving data privacy require careful and rigorous evaluation; otherwise, they risk providing a false sense of privacy.

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The paper is very well written and the experiments are well-explained and seem solid. - The paper can serve as a good reference for showing the strength of DP-SGD for obtaining good privacy-utility trad-eoff for synthetic image data.

Weaknesses

- The paper does not truly present many novel ideas, though it is another valuable demonstration about the effectiveness of DP-SGD for obtaining good privacy-utility tradeoff for ML models when the privacy protection is measured via the most vulnerable samples. Regarding the idea of auditing with worst-case samples, for example: it has been studied extensively and is already considered by Carlini et al, 2019, "The secret sharer: Evaluating and testing unintended memorization in neural networks

Reviewer 02Rating 6Confidence 3

Strengths

- This paper provides an apples-to-apples empirical privacy comparison for several methods of training vision models that claim to preserve privacy to some degree. In doing so, it improves upon prior empirical privacy analyses, some that were rather flawed. - The presentation is clear and the text is well-written. The figures throughout are particularly helpful. See question 1 on this point.

Weaknesses

- The DP-SGD comparison is potentially misleading. The "baseline" method does not satisfy differential privacy. See question 2. The paper would benefit from including an additional baseline that satisfies DP e.g. [4] discusses how to incorporate hyperparameter tuning into the privacy analysis. It may also be interesting to include a fully non-private baseline using a standard training routine i.e. just train ResNet with SGD. - The paper would benefit from a discussion of how the formal guarante

Reviewer 03Rating 6Confidence 2

Strengths

This broad approach offers a thorough understanding of various methodologies in synthetic data utilization and their impact on privacy. The study juxtaposes synthetic data-based techniques with Differential Privacy-SGD (DPSGD) as a baseline, which helps readers contextualize the efficacy of synthetic data methods in privacy preservation compared to a gold-standard approach like DPSGD. The study identifies instances where synthetic data, despite visual dissimilarity from private data, can still l

Weaknesses

The experiments focus on CIFAR-10 and specific models, such as ResNet-18, which may limit the generalizability of findings. The paper’s findings could vary across more complex datasets or architectures, and broader experiments could better represent the implications for privacy in diverse real-world scenarios. Techniques like DPSGD are noted for efficiency, yet they are resource-intensive. The paper briefly mentions but does not deeply engage with the practical constraints of computational cos

Code & Models

Repositories

yunpeng-zhao/syndata-privacy
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data

MethodsDiffusion