Generating High Quality Synthetic Data for Dutch Medical Conversations
Cecilia Kuan, Aditya Kamlesh Parikh, Henk van den Heuvel

TL;DR
This paper presents a pipeline for creating synthetic Dutch medical dialogues using a fine-tuned Large Language Model, aiming to enhance clinical NLP resources while addressing privacy concerns.
Contribution
The study introduces a novel method for generating synthetic Dutch medical conversations, evaluated through both quantitative metrics and expert qualitative review.
Findings
Synthetic dialogues show high lexical variety but scripted turn-taking.
Qualitative review indicates issues with domain specificity and naturalness.
Quantitative metrics alone do not fully capture linguistic quality.
Abstract
Medical conversations offer insights into clinical communication often absent from Electronic Health Records. However, developing reliable clinical Natural Language Processing (NLP) models is hampered by the scarcity of domain-specific datasets, as clinical data are typically inaccessible due to privacy and ethical constraints. To address these challenges, we present a pipeline for generating synthetic Dutch medical dialogues using a Dutch fine-tuned Large Language Model, with real medical conversations serving as linguistic and structural reference. The generated dialogues were evaluated through quantitative metrics and qualitative review by native speakers and medical practitioners. Quantitative analysis revealed strong lexical variety and overly regular turn-taking, suggesting scripted rather than natural conversation flow. Qualitative review produced slightly below-average scores,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
