# Data preparation method for machine learning-based breast cancer risk prediction: A Cuban case study

**Authors:** Jose Manuel Valencia-Moreno, Everardo Gutierrez-Lopez, Jose Angel Gonzalez-Fraga, Rodolfo Alan Martinez Rodriguez, Olivia Denisse Victoria Mejia, Alma Alejandra Soberano Serrano

PMC · DOI: 10.1016/j.mex.2025.103688 · MethodsX · 2025-10-28

## TL;DR

This paper introduces a dataset and preprocessing method for predicting breast cancer risk using data from Cuban women, supporting machine learning in public health.

## Contribution

A reproducible data preparation method for breast cancer risk prediction tailored to a Cuban population.

## Key findings

- A dataset of 1697 Cuban women's breast cancer risk factors was collected and processed.
- The preprocessing method ensured data quality and compatibility with machine learning techniques.
- Prediction models showed consistent performance across multiple metrics after preprocessing.

## Abstract

This article presents a dataset of breast cancer risk factors collected from 1697 Cuban women between 2001 and 2018, as a tool to design and support the development and validation of predictive models in public health for breast cancer risk. A reproducible methodology for quality control and variable enrichment was implemented to ensure data integrity and compatibility with machine learning techniques.

• Reproducible preprocessing methodology to ensure data quality and traceability.

• Open breast cancer risk factor dataset for epidemiological studies and risk assessment using machine learning.

• Consistent prediction model performance across multiple metrics after data preprocessing

Image, graphical abstract

## Linked entities

- **Diseases:** breast cancer (MONDO:0004989)

## Full-text entities

- **Diseases:** breast cancer (MESH:D001943)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12637264/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12637264/full.md

## References

20 references — full list in the complete paper: https://tomesphere.com/paper/PMC12637264/full.md

---
Source: https://tomesphere.com/paper/PMC12637264