Private Training & Data Generation by Clustering Embeddings

Felix Zhou; Samson Zhou; Vahab Mirrokni; Alessandro Epasto; Vincent Cohen-Addad

arXiv:2506.16661·cs.LG·June 23, 2025

Private Training & Data Generation by Clustering Embeddings

Felix Zhou, Samson Zhou, Vahab Mirrokni, Alessandro Epasto, Vincent Cohen-Addad

PDF

Open Access

TL;DR

This paper presents a new differentially private method for generating synthetic image embeddings using DP clustering of GMMs, enabling high-quality private data synthesis for training neural networks.

Contribution

The authors introduce a novel DP clustering approach to generate synthetic embeddings with provable GMM learning under separation conditions, improving privacy-preserving data generation.

Findings

01

Achieves state-of-the-art classification accuracy with synthetic embeddings

02

Generates realistic synthetic images with high downstream task performance

03

Method is scalable and adaptable to different tasks

Abstract

Deep neural networks often use large, high-quality datasets to achieve high performance on many machine learning tasks. When training involves potentially sensitive data, this process can raise privacy concerns, as large models have been shown to unintentionally memorize and reveal sensitive information, including reconstructing entire training samples. Differential privacy (DP) provides a robust framework for protecting individual data and in particular, a new approach to privately training deep neural networks is to approximate the input dataset with a privately generated synthetic dataset, before any subsequent training algorithm. We introduce a novel principled method for DP synthetic image embedding generation, based on fitting a Gaussian Mixture Model (GMM) in an appropriate embedding space using DP clustering. Our method provably learns a GMM under separation conditions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Stochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning