Utilizing Large Language Models to Generate Synthetic Data to Increase the Performance of BERT-Based Neural Networks
Chancellor R. Woolsey, Prakash Bisht, Joshua Rothman, Gondy Leroy

TL;DR
This study explores using large language models to generate synthetic medical data, aiming to enhance BERT-based neural network performance in autism diagnosis tasks, addressing data scarcity issues in healthcare.
Contribution
It demonstrates the feasibility of using LLM-generated synthetic data to improve model recall in autism-related classification tasks.
Findings
Synthetic data contained 83% correct labels
Data augmentation increased recall by 13%
Precision decreased by 16% with synthetic data
Abstract
An important issue impacting healthcare is a lack of available experts. Machine learning (ML) models could resolve this by aiding in diagnosing patients. However, creating datasets large enough to train these models is expensive. We evaluated large language models (LLMs) for data creation. Using Autism Spectrum Disorders (ASD), we prompted ChatGPT and GPT-Premium to generate 4,200 synthetic observations to augment existing medical data. Our goal is to label behaviors corresponding to autism criteria and improve model accuracy with synthetic training data. We used a BERT classifier pre-trained on biomedical literature to assess differences in performance between models. A random sample (N=140) from the LLM-generated data was evaluated by a clinician and found to contain 83% correct example-label pairs. Augmenting data increased recall by 13% but decreased precision by 16%, correlating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · COVID-19 diagnosis using AI · Advanced Data Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Multi-Head Attention · Dense Connections · Attention Dropout · Weight Decay · Dropout · Residual Connection · Adam · Softmax
