Improving Equity in Health Modeling with GPT4-Turbo Generated Synthetic Data: A Comparative Study
Daniel Smolyak, Arshana Welivita, Margr\'et V. Bjarnad\'ottir, Ritu, Agarwal

TL;DR
This study explores using GPT4-Turbo to generate synthetic, group-specific medical data to improve fairness and health equity in machine learning models, showing generally positive but context-dependent results.
Contribution
It introduces a pipeline for generating demographic-specific synthetic data with GPT4-Turbo to address data imbalance in health datasets.
Findings
GPT4-Turbo synthetic data often improves model performance.
Group-specific prompts yield limited additional benefits.
The method can enhance fairness in health data modeling.
Abstract
Objective. Demographic groups are often represented at different rates in medical datasets. These differences can create bias in machine learning algorithms, with higher levels of performance for better-represented groups. One promising solution to this problem is to generate synthetic data to mitigate potential adverse effects of non-representative data sets. Methods. We build on recent advances in LLM-based synthetic data generation to create a pipeline where the synthetic data is generated separately for each demographic group. We conduct our study using MIMIC-IV and Framingham "Offspring and OMNI-1 Cohorts" datasets. We prompt GPT4-Turbo to create group-specific data, providing training examples and the dataset context. An exploratory analysis is conducted to ascertain the quality of the generated data. We then evaluate the utility of the synthetic data for augmentation of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare
