Privacy-preserving datasets by capturing feature distributions with Conditional VAEs
Francesco Di Salvo, David Tafler, Sebastian Doerrich and, Christian Ledig

TL;DR
This paper presents a novel privacy-preserving data generation method using Conditional Variational Autoencoders trained on features from foundation models, enhancing data diversity and privacy in sensitive domains like medicine.
Contribution
Introduces a CVAE-based approach trained on foundation model features to generate diverse, privacy-preserving synthetic data, outperforming traditional anonymization methods.
Findings
Outperforms traditional anonymization in data diversity
Provides higher robustness against perturbations
Generates unbounded, privacy-respecting synthetic datasets
Abstract
Large and well-annotated datasets are essential for advancing deep learning applications, however often costly or impossible to obtain by a single entity. In many areas, including the medical domain, approaches relying on data sharing have become critical to address those challenges. While effective in increasing dataset size and diversity, data sharing raises significant privacy concerns. Commonly employed anonymization methods based on the k-anonymity paradigm often fail to preserve data diversity, affecting model robustness. This work introduces a novel approach using Conditional Variational Autoencoders (CVAEs) trained on feature vectors extracted from large pre-trained vision foundation models. Foundation models effectively detect and represent complex patterns across diverse domains, allowing the CVAE to faithfully capture the embedding space of a given data distribution to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Traffic Prediction and Management Techniques · Big Data Technologies and Applications
MethodsSparse Evolutionary Training · Conditional Variational Auto Encoder
