RedactOR: An LLM-Powered Framework for Automatic Clinical Data De-Identification
Praphul Singh, Charlotte Dzialo, Jangwon Kim, Sumana Srivatsa, Irfan Bulu, Sri Gadde, Krishnaram Kenthapadi

TL;DR
RedactOR is an automated, multi-modal framework that combines rule-based and LLM techniques to improve clinical data de-identification, ensuring privacy while maintaining data utility in healthcare applications.
Contribution
The paper introduces RedactOR, a novel framework integrating multi-modal strategies and efficient LLM use for improved clinical data de-identification.
Findings
Achieves competitive de-identification performance on i2b2 dataset.
Reduces LLM token costs through optimized strategies.
Ensures consistent entity substitution for data coherence.
Abstract
Ensuring clinical data privacy while preserving utility is critical for AI-driven healthcare and data analytics. Existing de-identification (De-ID) methods, including rule-based techniques, deep learning models, and large language models (LLMs), often suffer from recall errors, limited generalization, and inefficiencies, limiting their real-world applicability. We propose a fully automated, multi-modal framework, RedactOR for de-identifying structured and unstructured electronic health records, including clinical audio records. Our framework employs cost-efficient De-ID strategies, including intelligent routing, hybrid rule and LLM based approaches, and a two-step audio redaction approach. We present a retrieval-based entity relexicalization approach to ensure consistent substitutions of protected entities, thereby enhancing data coherence for downstream applications. We discuss key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Electronic Health Records Systems · Machine Learning in Healthcare
