Deduplication in a massive clinical note dataset

Sanjeev Shenoy; Tsung-Ting Kuo; Rodney Gabriel; Julian McAuley and; Chun-Nan Hsu

arXiv:1704.05617·cs.DB·April 20, 2017·1 cites

Deduplication in a massive clinical note dataset

Sanjeev Shenoy, Tsung-Ting Kuo, Rodney Gabriel, Julian McAuley and, Chun-Nan Hsu

PDF

Open Access

TL;DR

This paper introduces a scalable, accurate deduplication method for massive clinical note datasets using Minhashing, Locality Sensitive Hashing, and clustering techniques to efficiently identify and remove near duplicates.

Contribution

It presents a novel scalable deduplication approach combining Minhashing, Locality Sensitive Hashing, and clustering for large clinical datasets.

Findings

01

Effective detection of near duplicates in over 10 million notes

02

Scalable algorithm with improved speed and accuracy

03

Clustering enhances deduplication efficiency

Abstract

Duplication, whether exact or partial, is a common issue in many datasets. In clinical notes data, duplication (and near duplication) can arise for many reasons, such as the pervasive use of templates, copy-pasting, or notes being generated by automated procedures. A key challenge in removing such near duplicates is the size of such datasets; our own dataset consists of more than 10 million notes. To detect and correct such duplicates requires algorithms that both accurate and highly scalable. We describe a solution based on Minhashing with Locality Sensitive Hashing. In this paper, we present the theory behind this method and present a database-inspired approach to make the method scalable. We also present a clustering technique using disjoint sets to produce dense clusters, which speeds up our algorithm.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques