A Typology of Synthetic Datasets for Dialogue Processing in Clinical Contexts

Steven Bedrick; A. Seza Do\u{g}ru\"oz; Sergiu Nisioi

arXiv:2505.03025·cs.CL·March 17, 2026

A Typology of Synthetic Datasets for Dialogue Processing in Clinical Contexts

Steven Bedrick, A. Seza Do\u{g}ru\"oz, Sergiu Nisioi

PDF

Open Access

TL;DR

This paper reviews the creation and use of synthetic clinical dialogue datasets, proposing a new typology to classify and compare different types of data synthesis for healthcare NLP applications.

Contribution

It introduces a novel typology for classifying synthetic datasets in clinical dialogue processing, aiding in comparison and evaluation.

Findings

01

Synthetic datasets are increasingly used in healthcare NLP.

02

The paper provides an overview of creation and evaluation methods.

03

A new typology helps classify different synthesis approaches.

Abstract

Synthetic data sets are used across linguistic domains and NLP tasks, particularly in scenarios where authentic data is limited (or even non-existent). One such domain is that of clinical (healthcare) contexts, where there exist significant and long-standing challenges (e.g., privacy, anonymization, and data governance) which have led to the development of an increasing number of synthetic datasets. One increasingly important category of clinical dataset is that of clinical dialogues which are especially sensitive and difficult to collect, and as such are commonly synthesized. While such synthetic datasets have been shown to be sufficient in some situations, little theory exists to inform how they may be best used and generalized to new applications. In this paper, we provide an overview of how synthetic datasets are created, evaluated and being used for dialogue related tasks in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques