RedactOR: An LLM-Powered Framework for Automatic Clinical Data De-Identification

Praphul Singh; Charlotte Dzialo; Jangwon Kim; Sumana Srivatsa; Irfan Bulu; Sri Gadde; Krishnaram Kenthapadi

arXiv:2505.18380·cs.AI·July 28, 2025

RedactOR: An LLM-Powered Framework for Automatic Clinical Data De-Identification

Praphul Singh, Charlotte Dzialo, Jangwon Kim, Sumana Srivatsa, Irfan Bulu, Sri Gadde, Krishnaram Kenthapadi

PDF

Open Access 1 Video

TL;DR

RedactOR is an automated, multi-modal framework that combines rule-based and LLM techniques to improve clinical data de-identification, ensuring privacy while maintaining data utility in healthcare applications.

Contribution

The paper introduces RedactOR, a novel framework integrating multi-modal strategies and efficient LLM use for improved clinical data de-identification.

Findings

01

Achieves competitive de-identification performance on i2b2 dataset.

02

Reduces LLM token costs through optimized strategies.

03

Ensures consistent entity substitution for data coherence.

Abstract

Ensuring clinical data privacy while preserving utility is critical for AI-driven healthcare and data analytics. Existing de-identification (De-ID) methods, including rule-based techniques, deep learning models, and large language models (LLMs), often suffer from recall errors, limited generalization, and inefficiencies, limiting their real-world applicability. We propose a fully automated, multi-modal framework, RedactOR for de-identifying structured and unstructured electronic health records, including clinical audio records. Our framework employs cost-efficient De-ID strategies, including intelligent routing, hybrid rule and LLM based approaches, and a two-step audio redaction approach. We present a retrieval-based entity relexicalization approach to ensure consistent substitutions of protected entities, thereby enhancing data coherence for downstream applications. We discuss key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

RedactOR: An LLM-Powered Framework for Automatic Clinical Data De-Identification· underline

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Electronic Health Records Systems · Machine Learning in Healthcare