Enhancing Clinical Models with Pseudo Data for De-identification

Paul Landes; Aaron J Chaise; Tarak Nath Nandi; Ravi K Madduri

arXiv:2506.12674·cs.CL·June 18, 2025

Enhancing Clinical Models with Pseudo Data for De-identification

Paul Landes, Aaron J Chaise, Tarak Nath Nandi, Ravi K Madduri

PDF

Open Access 1 Repo

TL;DR

This paper investigates the impact of training clinical de-identification models on redacted versus pseudo text, demonstrating that pseudo data improves model performance and providing resources for future research.

Contribution

It introduces a novel pseudo data generation method for training de-identification models and offers insights and recommendations for training on redacted text.

Findings

01

Models trained on pseudo data outperform those trained on redacted text.

02

Pretrained embeddings and fine-tuned models enhance de-identification accuracy.

03

The study provides publicly available datasets and code for reproducibility.

Abstract

Many models are pretrained on redacted text for privacy reasons. Clinical foundation models are often trained on de-identified text, which uses special syntax (masked) text in place of protected health information. Even though these models have increased in popularity, there has been little effort in understanding the effects of training them on redacted text. In this work, we pretrain several encoder-only models on a dataset that contains redacted text and a version with replaced realistic pseudo text. We then fine-tuned models for the protected health information de-identification task and show how our methods significantly outperform previous baselines. The contributions of this work include: a) our novel, and yet surprising findings with training recommendations, b) redacted text replacements used to produce the pseudo dataset, c) pretrained embeddings and fine-tuned task specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

appfl/cpbert
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare