BRATsynthetic: Text De-identification using a Markov Chain Replacement Strategy for Surrogate Personal Identifying Information
John D. Osborne, Tobias O'Leary, Akhil Nadimpalli, Salma M. Aly. and, Richard E. Kennedy

TL;DR
This paper introduces BRATsynthetic, a Markov chain-based method for de-identifying clinical text that significantly reduces PHI leakage compared to traditional strategies, enabling safer data sharing.
Contribution
It presents a novel Markov chain approach for text de-identification, outperforming existing methods in privacy preservation across multiple datasets.
Findings
Markov chain strategy reduces PHI leakage from 27.1% to 0.1% at 0.1% FNER.
Outperforms consistent and random strategies in diverse clinical corpora.
Enables larger de-identified datasets at the same privacy risk level.
Abstract
Objective: Implement and assess personal health identifying information (PHI) substitution strategies and quantify their privacy preserving benefits. Materials and Methods: We implement and assess 3 different `Hiding in Plain Sight` (HIPS) strategies for PHI replacement including a standard Consistent replacement strategy, a Random replacement strategy and a novel Markov model-based strategy. We evaluate the privacy preserving benefits of these strategies on a synthetic PHI distribution and real clinical corpora from 2 different institutions using a range of false negative error rates (FNER). Results: Using FNER ranging from 0.1% to 5% PHI leakage at the document level could be reduced from 27.1% to 0.1% (0.1% FNER) and from 94.2% to 57.7% (5% FNER) utilizing the Markov chain strategy versus the Consistent strategy on a corpus containing a diverse set of notes from the University of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
