Enhancing Clinical Models with Pseudo Data for De-identification
Paul Landes, Aaron J Chaise, Tarak Nath Nandi, Ravi K Madduri

TL;DR
This paper investigates the impact of training clinical de-identification models on redacted versus pseudo text, demonstrating that pseudo data improves model performance and providing resources for future research.
Contribution
It introduces a novel pseudo data generation method for training de-identification models and offers insights and recommendations for training on redacted text.
Findings
Models trained on pseudo data outperform those trained on redacted text.
Pretrained embeddings and fine-tuned models enhance de-identification accuracy.
The study provides publicly available datasets and code for reproducibility.
Abstract
Many models are pretrained on redacted text for privacy reasons. Clinical foundation models are often trained on de-identified text, which uses special syntax (masked) text in place of protected health information. Even though these models have increased in popularity, there has been little effort in understanding the effects of training them on redacted text. In this work, we pretrain several encoder-only models on a dataset that contains redacted text and a version with replaced realistic pseudo text. We then fine-tuned models for the protected health information de-identification task and show how our methods significantly outperform previous baselines. The contributions of this work include: a) our novel, and yet surprising findings with training recommendations, b) redacted text replacements used to produce the pseudo dataset, c) pretrained embeddings and fine-tuned task specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare
