Privacy-Preserving Synthetic Datasets Over Weakly Constrained Domains
Luke Rodriguez, Bill Howe

TL;DR
This paper introduces a new algorithm for generating differentially private synthetic datasets over large, weakly constrained domains, improving data sharing while maintaining privacy without requiring domain-specific data inspection.
Contribution
The paper presents an algorithm that models unrepresented domains analytically, enabling privacy-preserving synthetic data generation in realistic open data scenarios.
Findings
Produces sensible results on real datasets
Models unrepresented domains analytically
Balances privacy and utility effectively
Abstract
Techniques to deliver privacy-preserving synthetic datasets take a sensitive dataset as input and produce a similar dataset as output while maintaining differential privacy. These approaches have the potential to improve data sharing and reuse, but they must be accessible to non-experts and tolerant of realistic data. Existing approaches make an implicit assumption that the active domain of the dataset is similar to the global domain, potentially violating differential privacy. In this paper, we present an algorithm for generating differentially private synthetic data over the large, weakly constrained domains we find in realistic open data situations. Our algorithm models the unrepresented domain analytically as a probability distribution to adjust the output and compute noise, avoiding the need to compute the full domain explicitly. We formulate the tradeoff between privacy and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Cryptography and Data Security · Stochastic Gradient Optimization Techniques
