# De-identification In practice

**Authors:** Besat Kassaie

arXiv: 1701.03129 · 2017-01-13

## TL;DR

This paper explores de-identifying sensitive medical information in texts using NLP techniques, specifically word embeddings and LSTM neural networks, achieving promising initial results without manual feature engineering.

## Contribution

It demonstrates the feasibility of using word embeddings and LSTM models for de-identification in medical texts without manual feature extraction.

## Key findings

- Promising initial results in identifying sensitive data
- Effective use of CBOW word vectors with LSTM
- Potential for larger-scale improvements

## Abstract

We report our effort to identify the sensitive information, subset of data items listed by HIPAA (Health Insurance Portability and Accountability), from medical text using the recent advances in natural language processing and machine learning techniques. We represent the words with high dimensional continuous vectors learned by a variant of Word2Vec called Continous Bag Of Words (CBOW). We feed the word vectors into a simple neural network with a Long Short-Term Memory (LSTM) architecture. Without any attempts to extract manually crafted features and considering that our medical dataset is too small to be fed into neural network, we obtained promising results. The results thrilled us to think about the larger scale of the project with precise parameter tuning and other possible improvements.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1701.03129/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/1701.03129/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/1701.03129/full.md

---
Source: https://tomesphere.com/paper/1701.03129