Distributed Record Linkage in Healthcare Data with Apache Spark
Mohammad Heydari, Reza Sarshar, Mohammad Ali Soltanshahi

TL;DR
This paper presents a novel distributed record linkage model using Apache Spark's machine learning library to improve healthcare data integration, addressing data imbalance and validating model effectiveness.
Contribution
The study introduces a new distributed data-matching approach leveraging Spark MLlib, specifically handling data imbalance with SVM and Regression algorithms for healthcare data linkage.
Findings
Model effectively handles data imbalance.
Results show the model is neither over-fitted nor under-fitted.
Distributed approach improves healthcare data integration.
Abstract
Healthcare data is a valuable resource for research, analysis, and decision-making in the medical field. However, healthcare data is often fragmented and distributed across various sources, making it challenging to combine and analyze effectively. Record linkage, also known as data matching, is a crucial step in integrating and cleaning healthcare data to ensure data quality and accuracy. Apache Spark, a powerful open-source distributed big data processing framework, provides a robust platform for performing record linkage tasks with the aid of its machine learning library. In this study, we developed a new distributed data-matching model based on the Apache Spark Machine Learning library. To ensure the correct functioning of our model, the validation phase has been performed on the training data. The main challenge is data imbalance because a large amount of data is labeled false, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Artificial Intelligence in Healthcare
MethodsSupport Vector Machine
