SNIP: An Adaptation of Sorted Neighborhood Methods for Deduplicating   Pedigree Data

Theodore Huang; Matthew Ploenzke; Danielle Braun

arXiv:2108.08773·stat.AP·August 20, 2021

SNIP: An Adaptation of Sorted Neighborhood Methods for Deduplicating Pedigree Data

Theodore Huang, Matthew Ploenzke, Danielle Braun

PDF

2 Repos

TL;DR

This paper introduces SNIP, an unsupervised algorithm based on sorted neighborhood methods, to efficiently identify and remove duplicate pedigree records in large clinical datasets, improving data quality for hereditary disease research.

Contribution

The paper presents SNIP, a novel adaptation of sorted neighborhood algorithms specifically designed for complex pedigree data deduplication, with demonstrated effectiveness on real-world datasets.

Findings

01

SNIP accurately detects pedigree duplicates in simulated data.

02

Application to real data uncovered large duplicate clusters, removing 33% of pedigrees.

03

The method improves data quality for hereditary disease analysis.

Abstract

Pedigree data contain family history information that is used to analyze hereditary diseases. These clinical data sets may contain duplicate records due to the same family visiting a clinic multiple times or a clinician entering multiple versions of the family for testing purposes. Inferences drawn from the data or using them for training or validation without removing the duplicates could lead to invalid conclusions, and hence identifying the duplicates is essential. Since family structures can be complex, existing deduplication algorithms cannot be applied directly. We first motivate the importance of deduplication by examining the impact of pedigree duplicates on the training and validation of a familial risk prediction model. We then introduce an unsupervised algorithm, which we call SNIP (Sorted NeIghborhood for Pedigrees), that builds on the sorted neighborhood method to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.