Data-Constrained Synthesis of Training Data for De-Identification
Thomas Vakili, Aron Henriksson, and Hercules Dalianis

TL;DR
This paper explores using large language models to generate synthetic clinical texts for de-identification, demonstrating that synthetic data can effectively train NER models with minimal performance loss, especially when NER models are highly accurate.
Contribution
It introduces a method for domain-adapting LLMs to generate synthetic clinical data for de-identification, emphasizing the importance of NER model quality and data size.
Findings
Synthetic data yields small performance drops in NER tasks.
Smaller datasets can suffice for domain adaptation of LLMs.
Effectiveness depends heavily on the accuracy of machine-annotating NER models.
Abstract
Many sensitive domains -- such as the clinical domain -- lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study -- using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsControl Systems and Identification
