De-identification of Unstructured Clinical Texts from Sequence to   Sequence Perspective

Md Monowar Anjum; Noman Mohammed; Xiaoqian Jiang

arXiv:2108.07971·cs.CL·September 13, 2021

De-identification of Unstructured Clinical Texts from Sequence to Sequence Perspective

Md Monowar Anjum, Noman Mohammed, Xiaoqian Jiang

PDF

TL;DR

This paper introduces a sequence-to-sequence learning approach for de-identifying unstructured clinical texts, achieving high recall rates comparable to existing models, and offers a novel formulation of the problem.

Contribution

It reformulates clinical text de-identification as a sequence-to-sequence task, leveraging recent advances in sequence modeling for improved performance.

Findings

01

Achieved 98.91% recall on i2b2 dataset

02

Comparable performance to state-of-the-art models

03

Proposed a new problem formulation for de-identification

Abstract

In this work, we propose a novel problem formulation for de-identification of unstructured clinical text. We formulate the de-identification problem as a sequence to sequence learning problem instead of a token classification problem. Our approach is inspired by the recent state-of -the-art performance of sequence to sequence learning models for named entity recognition. Early experimentation of our proposed approach achieved 98.91% recall rate on i2b2 dataset. This performance is comparable to current state-of-the-art models for unstructured clinical text de-identification.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.