Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods
Eva Prakash, Maayane Attias, Pierre Chambon, Justin Xu, Steven Truong, Jean-Benoit Delbrouck, Tessa Cook, Curtis Langlotz

TL;DR
This study enhances radiology report de-identification by scaling transformer models with large datasets, achieving superior performance and benchmarking against commercial systems, thereby improving privacy and data utility in clinical text processing.
Contribution
Introduces a large-scale, multimodal transformer-based de-identification model with new PHI categories and benchmarks it against commercial systems, demonstrating improved accuracy and robustness.
Findings
Achieved F1 scores of 0.973 and 0.996 on external and internal datasets.
Outperformed commercial vendor systems in PHI detection.
Synthetic PHI generation maintained high detectability (F1: 0.959).
Abstract
Objective: To enhance automated de-identification of radiology reports by scaling transformer-based models through extensive training datasets and benchmarking performance against commercial cloud vendor systems for protected health information (PHI) detection. Materials and Methods: In this retrospective study, we built upon a state-of-the-art, transformer-based, PHI de-identification pipeline by fine-tuning on two large annotated radiology corpora from Stanford University, encompassing chest X-ray, chest CT, abdomen/pelvis CT, and brain MR reports and introducing an additional PHI category (AGE) into the architecture. Model performance was evaluated on test sets from Stanford and the University of Pennsylvania (Penn) for token-level PHI detection. We further assessed (1) the stability of synthetic PHI generation using a "hide-in-plain-sight" method and (2) performance against…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · Radiology practices and education · Artificial Intelligence in Healthcare and Education
