TL;DR
This paper introduces SNIP, an unsupervised algorithm based on sorted neighborhood methods, to efficiently identify and remove duplicate pedigree records in large clinical datasets, improving data quality for hereditary disease research.
Contribution
The paper presents SNIP, a novel adaptation of sorted neighborhood algorithms specifically designed for complex pedigree data deduplication, with demonstrated effectiveness on real-world datasets.
Findings
SNIP accurately detects pedigree duplicates in simulated data.
Application to real data uncovered large duplicate clusters, removing 33% of pedigrees.
The method improves data quality for hereditary disease analysis.
Abstract
Pedigree data contain family history information that is used to analyze hereditary diseases. These clinical data sets may contain duplicate records due to the same family visiting a clinic multiple times or a clinician entering multiple versions of the family for testing purposes. Inferences drawn from the data or using them for training or validation without removing the duplicates could lead to invalid conclusions, and hence identifying the duplicates is essential. Since family structures can be complex, existing deduplication algorithms cannot be applied directly. We first motivate the importance of deduplication by examining the impact of pedigree duplicates on the training and validation of a familial risk prediction model. We then introduce an unsupervised algorithm, which we call SNIP (Sorted NeIghborhood for Pedigrees), that builds on the sorted neighborhood method to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
