A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives

Hanshu Rao; Weisi Liu; Haohan Wang; I-Chan Huang; Zhe He; Xiaolei Huang

arXiv:2506.16594·cs.CL·February 18, 2026

A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives

Hanshu Rao, Weisi Liu, Haohan Wang, I-Chan Huang, Zhe He, Xiaolei Huang

PDF

TL;DR

This paper systematically reviews recent advances in synthetic data generation using large language models in biomedical research, highlighting methods, data modalities, evaluation challenges, and future directions for improving data utility and quality.

Contribution

It provides a comprehensive synthesis of recent studies on LLM-based synthetic data in biomedicine, emphasizing evaluation methods, limitations, and future research needs.

Findings

01

Unstructured texts are the most common data modality (78%).

02

Prompting is the predominant generation method (74.6%).

03

Evaluation methods are heterogeneous and lack standardization.

Abstract

Synthetic data generation using large language models (LLMs) demonstrates substantial promise in addressing biomedical data challenges and shows increasing adoption in biomedical research. This study systematically reviews recent advances in synthetic data generation for biomedical applications and clinical research, focusing on how LLMs address data scarcity, utility, and quality issues with different modalities. We conducted a scoping review following PRISMA-ScR guidelines and searched literature published between 2020 and 2025 through PubMed, ACM, Web of Science, and Google Scholar. A total of 59 studies were included based on relevance to synthetic data generation in biomedical contexts. Among the reviewed studies, the predominant data modalities were unstructured texts (78.0\%), tabular data (13.6\%), and multimodal sources (8.4\%). Common generation methods included LLM prompting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.