# An oversampling-undersampling strategy for large-scale data linkage

**Authors:** Hossein Hassani, Mohammad Reza Entezarian, Sara Zaeimzadeh, Leila Marvian, Nadejda Komendantova

PMC · DOI: 10.3389/fdata.2025.1542483 · Frontiers in Big Data · 2025-04-23

## TL;DR

This paper introduces a method to improve record linkage in big data by balancing imbalanced datasets through oversampling and undersampling.

## Contribution

The novel approach combines oversampling and undersampling to address data imbalance in large-scale record linkage.

## Key findings

- The strategy improves accuracy and efficiency in record linkage for imbalanced datasets.
- Sensitivity tests showed effectiveness under varying training-test ratios and imbalance levels.

## Abstract

Effective record linkage in big data, particularly in imbalanced datasets, is a critical yet highly challenging task due to the inherent complexity involved. This article utilizes an oversampling-undersampling strategy to address linkage imbalances, enabling more accurate and efficient record linkage within large-scale datasets. It tries to increase the instances of the minority class and decrease the dominance of the majority classes to try to reach a more balanced dataset that can be used for training and testing. Sensitivity testing was carried out by varying the training-test ratio and degree of imbalance.

## Full-text entities

- **Diseases:** Cancer (MESH:D009369), NB (MESH:D000074021)
- **Chemicals:** SMOTE (-)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12055850/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12055850/full.md

## References

22 references — full list in the complete paper: https://tomesphere.com/paper/PMC12055850/full.md

---
Source: https://tomesphere.com/paper/PMC12055850