Distributed Record Linkage in Healthcare Data with Apache Spark

Mohammad Heydari; Reza Sarshar; Mohammad Ali Soltanshahi

arXiv:2404.07939·cs.DC·April 12, 2024·1 cites

Distributed Record Linkage in Healthcare Data with Apache Spark

Mohammad Heydari, Reza Sarshar, Mohammad Ali Soltanshahi

PDF

Open Access

TL;DR

This paper presents a novel distributed record linkage model using Apache Spark's machine learning library to improve healthcare data integration, addressing data imbalance and validating model effectiveness.

Contribution

The study introduces a new distributed data-matching approach leveraging Spark MLlib, specifically handling data imbalance with SVM and Regression algorithms for healthcare data linkage.

Findings

01

Model effectively handles data imbalance.

02

Results show the model is neither over-fitted nor under-fitted.

03

Distributed approach improves healthcare data integration.

Abstract

Healthcare data is a valuable resource for research, analysis, and decision-making in the medical field. However, healthcare data is often fragmented and distributed across various sources, making it challenging to combine and analyze effectively. Record linkage, also known as data matching, is a crucial step in integrating and cleaning healthcare data to ensure data quality and accuracy. Apache Spark, a powerful open-source distributed big data processing framework, provides a robust platform for performing record linkage tasks with the aid of its machine learning library. In this study, we developed a new distributed data-matching model based on the Apache Spark Machine Learning library. To ensure the correct functioning of our model, the validation phase has been performed on the training data. The main challenge is data imbalance because a large amount of data is labeled false, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Artificial Intelligence in Healthcare

MethodsSupport Vector Machine