Utilizing Large Language Models to Generate Synthetic Data to Increase   the Performance of BERT-Based Neural Networks

Chancellor R. Woolsey; Prakash Bisht; Joshua Rothman; Gondy Leroy

arXiv:2405.06695·cs.CL·May 14, 2024·6 cites

Utilizing Large Language Models to Generate Synthetic Data to Increase the Performance of BERT-Based Neural Networks

Chancellor R. Woolsey, Prakash Bisht, Joshua Rothman, Gondy Leroy

PDF

Open Access

TL;DR

This study explores using large language models to generate synthetic medical data, aiming to enhance BERT-based neural network performance in autism diagnosis tasks, addressing data scarcity issues in healthcare.

Contribution

It demonstrates the feasibility of using LLM-generated synthetic data to improve model recall in autism-related classification tasks.

Findings

01

Synthetic data contained 83% correct labels

02

Data augmentation increased recall by 13%

03

Precision decreased by 16% with synthetic data

Abstract

An important issue impacting healthcare is a lack of available experts. Machine learning (ML) models could resolve this by aiding in diagnosing patients. However, creating datasets large enough to train these models is expensive. We evaluated large language models (LLMs) for data creation. Using Autism Spectrum Disorders (ASD), we prompted ChatGPT and GPT-Premium to generate 4,200 synthetic observations to augment existing medical data. Our goal is to label behaviors corresponding to autism criteria and improve model accuracy with synthetic training data. We used a BERT classifier pre-trained on biomedical literature to assess differences in performance between models. A random sample (N=140) from the LLM-generated data was evaluated by a clinician and found to contain 83% correct example-label pairs. Augmenting data increased recall by 13% but decreased precision by 16%, correlating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · COVID-19 diagnosis using AI · Advanced Data Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Multi-Head Attention · Dense Connections · Attention Dropout · Weight Decay · Dropout · Residual Connection · Adam · Softmax