FairCauseSyn: Towards Causally Fair LLM-Augmented Synthetic Data Generation

Nitish Nagesh; Ziyu Wang; Amir M. Rahmani

arXiv:2506.19082·cs.LG·June 25, 2025

FairCauseSyn: Towards Causally Fair LLM-Augmented Synthetic Data Generation

Nitish Nagesh, Ziyu Wang, Amir M. Rahmani

PDF

Open Access

TL;DR

This paper introduces FairCauseSyn, a novel LLM-augmented method for generating synthetic health data that maintains causal fairness, reducing bias and improving equitable health research.

Contribution

It is the first to incorporate causal fairness into LLM-based synthetic health data generation, addressing a key gap in existing methods.

Findings

01

Synthetic data deviates less than 10% from real data on fairness metrics.

02

Training on causally fair predictors reduces bias by 70%.

03

Enhances access to fair synthetic health data.

Abstract

Synthetic data generation creates data based on real-world data using generative models. In health applications, generating high-quality data while maintaining fairness for sensitive attributes is essential for equitable outcomes. Existing GAN-based and LLM-based methods focus on counterfactual fairness and are primarily applied in finance and legal domains. Causal fairness provides a more comprehensive evaluation framework by preserving causal structure, but current synthetic data generation methods do not address it in health settings. To fill this gap, we develop the first LLM-augmented synthetic data generation method to enhance causal fairness using real-world tabular health data. Our generated data deviates by less than 10% from real data on causal fairness metrics. When trained on causally fair predictors, synthetic data reduces bias on the sensitive attribute by 70% compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Adversarial Robustness in Machine Learning