Sensitive Data Detection with High-Throughput Machine Learning Models in Electrical Health Records
Kai Zhang, Xiaoqian Jiang

TL;DR
This paper presents a machine learning approach to automatically identify sensitive protected health information in electronic health records, enabling more efficient data sharing and de-identification across diverse healthcare datasets.
Contribution
It introduces a novel feature engineering method based on metadata differences and demonstrates high accuracy in detecting PHI fields across multiple datasets.
Findings
Achieved 99% accuracy in PHI detection on unseen datasets
Engineered over 30 features from metadata for classification
Effective across heterogeneous EHR data sources
Abstract
In the era of big data, there is an increasing need for healthcare providers, communities, and researchers to share data and collaborate to improve health outcomes, generate valuable insights, and advance research. The Health Insurance Portability and Accountability Act of 1996 (HIPAA) is a federal law designed to protect sensitive health information by defining regulations for protected health information (PHI). However, it does not provide efficient tools for detecting or removing PHI before data sharing. One of the challenges in this area of research is the heterogeneous nature of PHI fields in data across different parties. This variability makes rule-based sensitive variable identification systems that work on one database fail on another. To address this issue, our paper explores the use of machine learning algorithms to identify sensitive variables in structured data, thus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Big Data Technologies and Applications
Methodsfail
