Representative & Fair Synthetic Data
Paul Tiwald, Alexandra Ebert, Daniel T. Soukup

TL;DR
This paper introduces a framework for generating synthetic data that is both representative and fair, aiming to reduce societal biases in AI training data while preserving data utility.
Contribution
It proposes a novel method to incorporate fairness constraints into self-supervised generative models, enabling the creation of unbiased synthetic datasets.
Findings
Successfully generated fair synthetic data for the UCI Adult census dataset.
Biases in gender and race are controlled while maintaining data relationships.
Downstream models trained on synthetic data show reduced bias compared to original data.
Abstract
Algorithms learn rules and associations based on the training data that they are exposed to. Yet, the very same data that teaches machines to understand and predict the world, contains societal and historic biases, resulting in biased algorithms with the risk of further amplifying these once put into use for decision support. Synthetic data, on the other hand, emerges with the promise to provide an unlimited amount of representative, realistic training samples, that can be shared further without disclosing the privacy of individual subjects. We present a framework to incorporate fairness constraints into the self-supervised learning process, that allows to then simulate an unlimited amount of representative as well as fair synthetic data. This framework provides a handle to govern and control for privacy as well as for bias within AI at its very source: the training data. We demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI)
