Improving Equity in Health Modeling with GPT4-Turbo Generated Synthetic   Data: A Comparative Study

Daniel Smolyak; Arshana Welivita; Margr\'et V. Bjarnad\'ottir; Ritu; Agarwal

arXiv:2412.16335·cs.LG·December 24, 2024·2 cites

Improving Equity in Health Modeling with GPT4-Turbo Generated Synthetic Data: A Comparative Study

Daniel Smolyak, Arshana Welivita, Margr\'et V. Bjarnad\'ottir, Ritu, Agarwal

PDF

Open Access

TL;DR

This study explores using GPT4-Turbo to generate synthetic, group-specific medical data to improve fairness and health equity in machine learning models, showing generally positive but context-dependent results.

Contribution

It introduces a pipeline for generating demographic-specific synthetic data with GPT4-Turbo to address data imbalance in health datasets.

Findings

01

GPT4-Turbo synthetic data often improves model performance.

02

Group-specific prompts yield limited additional benefits.

03

The method can enhance fairness in health data modeling.

Abstract

Objective. Demographic groups are often represented at different rates in medical datasets. These differences can create bias in machine learning algorithms, with higher levels of performance for better-represented groups. One promising solution to this problem is to generate synthetic data to mitigate potential adverse effects of non-representative data sets. Methods. We build on recent advances in LLM-based synthetic data generation to create a pipeline where the synthetic data is generated separately for each demographic group. We conduct our study using MIMIC-IV and Framingham "Offspring and OMNI-1 Cohorts" datasets. We prompt GPT4-Turbo to create group-specific data, providing training examples and the dataset context. An exploratory analysis is conducted to ascertain the quality of the generated data. We then evaluate the utility of the synthetic data for augmentation of a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare