A Typology of Synthetic Datasets for Dialogue Processing in Clinical Contexts
Steven Bedrick, A. Seza Do\u{g}ru\"oz, Sergiu Nisioi

TL;DR
This paper reviews the creation and use of synthetic clinical dialogue datasets, proposing a new typology to classify and compare different types of data synthesis for healthcare NLP applications.
Contribution
It introduces a novel typology for classifying synthetic datasets in clinical dialogue processing, aiding in comparison and evaluation.
Findings
Synthetic datasets are increasingly used in healthcare NLP.
The paper provides an overview of creation and evaluation methods.
A new typology helps classify different synthesis approaches.
Abstract
Synthetic data sets are used across linguistic domains and NLP tasks, particularly in scenarios where authentic data is limited (or even non-existent). One such domain is that of clinical (healthcare) contexts, where there exist significant and long-standing challenges (e.g., privacy, anonymization, and data governance) which have led to the development of an increasing number of synthetic datasets. One increasingly important category of clinical dataset is that of clinical dialogues which are especially sensitive and difficult to collect, and as such are commonly synthesized. While such synthetic datasets have been shown to be sufficient in some situations, little theory exists to inform how they may be best used and generalized to new applications. In this paper, we provide an overview of how synthetic datasets are created, evaluated and being used for dialogue related tasks in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques
