Exploring the Complexity of Real‐World Health Data Record Linkage—An Exemplary Study Linking Cancer Registry and Claims Data
Nadja Lendle, Bianca Kollhorst, Timm Intemann

TL;DR
This study explores the challenges of linking health data using quasi-identifiers and finds that machine learning improves the accuracy of record linkage.
Contribution
The study introduces informed linkage algorithms using gold standard links and compares machine learning methods for health data linkage.
Findings
Gradient boosting achieved the best performance with 77% precision and 81% recall.
33% of cancer registry patients could not be uniquely identified using quasi-identifiers.
Using unique identifiers from a subsample improves linkage quality for the entire dataset.
Abstract
Record linkage based on quasi‐identifiers remains an important approach as not every data source provides a comprehensive unique identifier. In this study, the reasons for the failure of a linkage based on quasi‐identifiers were examined. Furthermore, informed algorithms using information on gold standard links were developed to investigate the potentially achievable linkage quality based on quasi‐identifiers. The study population includes patients from an antidiabetic cohort from German claims and colorectal cancer patients from two German cancer registries. Linkage algorithms were applied using information on gold standard links. Informed linkage algorithms based on deterministic linkage, logistic regression, random forests, gradient boosting, and neural networks were derived and compared. Descriptive analyses were performed to identify reasons for the failure of linkage, such as…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data-Driven Disease Surveillance · Ethics in Clinical Research
