# Machine Learning Techniques Used for the Identification of Sociodemographic Factors Associated With Cancer: Systematic Literature Review

**Authors:** Liz González-Infante, Gaston Marquez, Solange Parra-Soto, Mónica Cardona-Valencia, Carla Taramasco

PMC · DOI: 10.2196/79187 · 2026-01-28

## TL;DR

This paper reviews how machine learning is used to study the link between sociodemographic factors and cancer outcomes, highlighting gaps and opportunities for more equitable cancer care.

## Contribution

The paper systematically reviews ML applications in identifying sociodemographic factors linked to cancer outcomes, emphasizing methodological trends and limitations.

## Key findings

- Most studies used supervised ML techniques like random forest and extreme gradient boosting.
- Common sociodemographic variables included age, gender, education, income, and geographic location.
- External validation and integration of clinical data remain limited in current research.

## Abstract

Cancer remains one of the foremost global causes of mortality, with nearly 10 million deaths recorded by 2020. As incidence rates rise, there is a growing interest in leveraging machine learning (ML) to enhance prediction, diagnosis, and treatment strategies. Despite these advancements, insufficient attention has been directed toward the integration of sociodemographic variables, which are crucial determinants of health equity, into ML models in oncology.

This review aims to investigate how ML techniques have been used to identify patterns of predictive association between sociodemographic factors and cancer-related outcomes. Specifically, it seeks to map current research endeavors by detailing the types of algorithms used, the sociodemographic variables examined, and the validation methodologies used.

We conducted a systematic literature review in accordance with the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. Searches were executed across 6 databases, focusing on the primary studies using ML to investigate the association between sociodemographic characteristics and cancer-related outcomes. The search strategy was informed by the PICO (population, intervention, comparison, and outcome) framework, and a set of predefined inclusion criteria was used to screen the studies. The methodological quality of each included paper was assessed.

Out of the 328 records examined, 19 satisfied the inclusion criteria. The majority of studies used supervised ML techniques, with random forest and extreme gradient boosting being the most commonly used. Frequently analyzed variables include age, male or female or intersex, education level, income, and geographic location. Cross-validation is the predominant method for evaluating model performance. Nevertheless, the integration of clinical and sociodemographic data is limited, and efforts toward external validation are infrequent.

ML holds significant potential for discerning patterns associated with the social determinants of cancer. Nevertheless, research in this domain remains fragmented and inconsistent. Future investigations should prioritize the integration of contextual factors, enhance model transparency, and bolster external validation. These measures are crucial for the development of more equitable, generalizable, and actionable ML applications in cancer care.

## Linked entities

- **Diseases:** cancer (MONDO:0004992)

## Full-text entities

- **Diseases:** deaths (MESH:D003643), Cancer (MESH:D009369)

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12851563/full.md

---
Source: https://tomesphere.com/paper/PMC12851563