# A Markov Chain Replacement Strategy for Surrogate Identifiers: Minimizing Re-Identification Risk While Preserving Text Reuse

**Authors:** John D. Osborne, Andrew Trotter, Tobias O’Leary, Chris Coffee, Micah D. Cochran, Luis Mansilla-Gonzalez, Akhil Nadimpalli, Alex McAnnally, Abdulateef I. Almudaifer, Jeffrey R. Curtis, Salma M. Aly, Richard E. Kennedy

PMC · DOI: 10.3390/electronics14193945 · Electronics · 2025-10-21

## TL;DR

This paper introduces a new method for replacing personal health information in text to reduce re-identification risks while maintaining data utility for analysis.

## Contribution

A novel Markov model strategy for PHI replacement that reduces re-identification risk more effectively than existing methods.

## Key findings

- The Markov strategy significantly reduces PHI leakage compared to standard substitution methods.
- Modern deep learning methods perform similarly across all strategies, but older techniques are affected by context changes.
- The Markov strategy achieves up to 99.6% reduction in document-level PHI leakage at low error rates.

## Abstract

“Hiding in Plain Sight” (HIPS) strategies for Personal Health Information (PHI) replace PHI with surrogate values to hinder re-identification attempts. We evaluate three different HIPS strategies for PHI replacement, a standard Consistent replacement strategy, a Random replacement strategy, and a novel Markov model strategy. We evaluate the privacy-preserving benefits and relative utility for information extraction of these strategies on both a simulated PHI distribution and real clinical corpora from two different institutions using a range of false negative error rates (FNER). The Markov strategy consistently outperformed the Consistent and Random substitution strategies on both real data and in statistical simulations. Using FNER ranging from 0.1% to 5%, PHI leakage at the document level could be reduced from 27.1% to 0.1% and from 94.2% to 57.7% with the Markov strategy versus the standard Consistent substitution strategy, at 0.1% and 0.5% FNER, respectively. Additionally, we assessed the generated corpora containing synthetic PHI for reuse using a variety of information extraction methods. Results indicate that modern deep learning methods have similar performance on all strategies, but older machine learning techniques can suffer from the change in context. Overall, a Markov surrogate generation strategy substantially reduces the chance of inadvertent PHI release.

## Full-text entities

- **Diseases:** PERSONAL (MESH:D010554), Opiate Use Disorder (MESH:D000437), Delirium (MESH:D003693), leak (MESH:D019559), FN (MESH:D017541), SYNTHETIC (OMIM:146820), OUD (MESH:D009293)
- **Chemicals:** BRATsynthetic (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12536513/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12536513/full.md

## References

48 references — full list in the complete paper: https://tomesphere.com/paper/PMC12536513/full.md

---
Source: https://tomesphere.com/paper/PMC12536513