FairTabGen: High-Fidelity and Fair Synthetic Health Data Generation from Limited Samples
Nitish Nagesh, Salar Shakibhamedan, Mahdi Bagheri, Ziyu Wang, Nima TaheriNejad, Axel Jantsch, Amir M. Rahmani

TL;DR
FairTabGen is a novel LLM-based framework that generates high-quality, fair synthetic healthcare data from limited samples, addressing privacy concerns and reducing computational requirements in clinical research.
Contribution
It introduces a new method combining in-context learning, prompt curation, and structural constraints for efficient, fair synthetic health data generation from small datasets.
Findings
Uses 99% less data than traditional methods
Achieves 50% improvement in fairness metrics
Enhances fairness by 10% with bias mitigation techniques
Abstract
Synthetic healthcare data generation offers a promising solution to research limitations in clinical settings caused by privacy and regulatory constraints. However, current synthetic data generation approaches require specialized knowledge about training generative models and require high computational resources. In this paper, we propose FairTabGen, an LLM-based tabular data generation framework that produces high-quality synthetic healthcare data using only a small subset of the original dataset. Our method combines in-context learning, prompt curation and embedding structural constraints for data synthesis. We evaluate performance on MIMIC-IV dataset. Our method using 99% less data and achieving 50% improvement for fairness through unawareness while maintaining competitive predictive utility. However, we observe data distribution of racial groups is skewed affecting demographic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI
