Collaborative Learning From Distributed Data With Differentially Private Synthetic Twin Data
Lukas Prediger, Joonas J\"alk\"o, Antti Honkela, Samuel Kaski

TL;DR
This paper introduces a framework where multiple parties share differentially private synthetic data to collaboratively learn population statistics, improving accuracy especially for small or heterogeneous datasets without compromising privacy.
Contribution
The study demonstrates that sharing differentially private synthetic twin data enables effective collaborative learning from sensitive health data, even with small or diverse datasets.
Findings
Synthetic twin data improves statistical estimates over local data.
More participating parties lead to larger, more consistent improvements.
Sharing synthetic data benefits underrepresented groups in analysis.
Abstract
Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible. We propose a framework in which each party shares a differentially private synthetic twin of their data. We study the feasibility of combining such synthetic twin data sets for collaborative learning on real-world health data from the UK Biobank. We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of target statistics compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Data-Driven Disease Surveillance · Survey Methodology and Nonresponse
