Generating Synthetic Free-text Medical Records with Low   Re-identification Risk using Masked Language Modeling

Samuel Belkadi; Libo Ren; Nicolo Micheletti; Lifeng Han; Goran Nenadic

arXiv:2409.09831·cs.CL·January 31, 2025

Generating Synthetic Free-text Medical Records with Low Re-identification Risk using Masked Language Modeling

Samuel Belkadi, Libo Ren, Nicolo Micheletti, Lifeng Han, Goran Nenadic

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Masked Language Modeling system for generating synthetic medical records that balance data utility with privacy, achieving high-quality data with low re-identification risk and cost-effective inference.

Contribution

It presents a novel Masked Language Modeling approach for synthetic medical data that preserves privacy while maintaining data diversity and utility.

Findings

01

High-quality synthetic data with 96% HIPAA-compliant PHI recall

02

Re-identification risk reduced to 3.5%

03

Generated data enables effective model training comparable to real data

Abstract

The vast amount of available medical records has the potential to improve healthcare and biomedical research. However, privacy restrictions make these data accessible for internal use only. Recent works have addressed this problem by generating synthetic data using Causal Language Modeling. Unfortunately, by taking this approach, it is often impossible to guarantee patient privacy while offering the ability to control the diversity of generations without increasing the cost of generating such data. In contrast, we present a system for generating synthetic free-text medical records using Masked Language Modeling. The system preserves critical medical information while introducing diversity in the generations and minimising re-identification risk. The system's size is about 120M parameters, minimising inference cost. The results demonstrate high-quality synthetic data with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SamySam0/SynDeidMLM
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies