De-identification of medical records using conditional random fields and   long short-term memory networks

Zhipeng Jiang; Chao Zhao; Bin He; Yi Guan; Jingchi Jiang

arXiv:1709.06901·cs.CL·October 2, 2017·1 cites

De-identification of medical records using conditional random fields and long short-term memory networks

Zhipeng Jiang, Chao Zhao, Bin He, Yi Guan, Jingchi Jiang

PDF

Open Access

TL;DR

This paper compares CRF and LSTM models for de-identifying psychiatric records, demonstrating that LSTMs outperform CRFs with higher accuracy in identifying protected health information.

Contribution

It introduces a novel LSTM-based approach for de-identification and compares its performance with traditional CRF models on clinical text.

Findings

01

LSTM system achieved an i2b2 F1 score of 89.86%.

02

LSTMs outperformed CRFs in PHI detection accuracy.

03

Pre-processing improved model performance.

Abstract

The CEGS N-GRID 2016 Shared Task 1 in Clinical Natural Language Processing focuses on the de-identification of psychiatric evaluation records. This paper describes two participating systems of our team, based on conditional random fields (CRFs) and long short-term memory networks (LSTMs). A pre-processing module was introduced for sentence detection and tokenization before de-identification. For CRFs, manually extracted rich features were utilized to train the model. For LSTMs, a character-level bi-directional LSTM network was applied to represent tokens and classify tags for each token, following which a decoding layer was stacked to decode the most probable protected health information (PHI) terms. The LSTM-based system attained an i2b2 strict micro-F_1 measure of 89.86%, which was higher than that of the CRF-based system.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Natural Language Processing Techniques

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory