Deduplication in a massive clinical note dataset
Sanjeev Shenoy, Tsung-Ting Kuo, Rodney Gabriel, Julian McAuley and, Chun-Nan Hsu

TL;DR
This paper introduces a scalable, accurate deduplication method for massive clinical note datasets using Minhashing, Locality Sensitive Hashing, and clustering techniques to efficiently identify and remove near duplicates.
Contribution
It presents a novel scalable deduplication approach combining Minhashing, Locality Sensitive Hashing, and clustering for large clinical datasets.
Findings
Effective detection of near duplicates in over 10 million notes
Scalable algorithm with improved speed and accuracy
Clustering enhances deduplication efficiency
Abstract
Duplication, whether exact or partial, is a common issue in many datasets. In clinical notes data, duplication (and near duplication) can arise for many reasons, such as the pervasive use of templates, copy-pasting, or notes being generated by automated procedures. A key challenge in removing such near duplicates is the size of such datasets; our own dataset consists of more than 10 million notes. To detect and correct such duplicates requires algorithms that both accurate and highly scalable. We describe a solution based on Minhashing with Locality Sensitive Hashing. In this paper, we present the theory behind this method and present a database-inspired approach to make the method scalable. We also present a clustering technique using disjoint sets to produce dense clusters, which speeds up our algorithm.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
