Scrubbing Sensitive PHI Data from Medical Records made Easy by SpaCy --   A Scalable Model Implementation Comparisons

Rashmi Jain; Dinah Samuel Anand; Vijayalakshmi Janakiraman

arXiv:1906.06968·cs.LG·June 18, 2019

Scrubbing Sensitive PHI Data from Medical Records made Easy by SpaCy -- A Scalable Model Implementation Comparisons

Rashmi Jain, Dinah Samuel Anand, Vijayalakshmi Janakiraman

PDF

Open Access

TL;DR

This paper compares various deep learning techniques for de-identifying sensitive PHI data in medical records, highlighting SpaCy's superior performance and efficiency in scalable implementation.

Contribution

The study evaluates multiple models for PHI de-identification and demonstrates that SpaCy offers a highly effective and scalable solution.

Findings

01

SpaCy outperforms other models in accuracy and speed

02

Deep learning models vary significantly in scalability

03

SpaCy's implementation is suitable for large-scale medical data processing

Abstract

De-identification of clinical records is an extremely important process which enables the use of the wealth of information present in them. There are a lot of techniques available for this but none of the method implementation has evaluated the scalability, which is an important benchmark. We evaluated numerous deep learning techniques such as BiLSTM-CNN, IDCNN, CRF, BiLSTM-CRF, SpaCy, etc. on both the performance and efficiency. We propose that the SpaCy model implementation for scrubbing sensitive PHI data from medical records is both well performing and extremely efficient compared to other published models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Semantic Web and Ontologies

MethodsConditional Random Field