# Predicting and identifying correlates of inequalities in breast cancer screening uptake using national level data from India

**Authors:** Aleena Tanveer, Raja Hashim Ali, Jitendra Majhi, Moumita Mukherjee

PMC · DOI: 10.3389/frai.2025.1729796 · Frontiers in Artificial Intelligence · 2026-01-20

## TL;DR

This study uses machine learning to identify factors contributing to low breast cancer screening rates in India, highlighting socioeconomic and structural inequalities.

## Contribution

The novel use of machine learning to predict and decompose inequities in breast cancer screening uptake in India.

## Key findings

- Screening uptake is remarkably low (0.9%) with significant disparities across economic, educational, and social gradients.
- Random Forest and XGBoost showed high predictive accuracy, while Decision Tree provided stable generalizability.
- Education, autonomy, and community health worker interactions were key factors explaining variability in screening uptake.

## Abstract

Despite national screening initiatives, coverage of breast cancer screening is low, and late-stage diagnosis remains a major contributor to mortality among Indian women. Accurate, precise, and actionable prediction of socioeconomic and structural inequities in screening uptake is critical for formulating equitable cancer control policies. This study aimed to apply machine learning to predict determinants of screening uptake, estimate inequalities in uptake and their concentration indices, and identify contributing factors to inequity using concentration index decomposition across economic, educational, and caste gradients.

Cross-sectional National Family Health Survey (NFHS-5) 2019–2021 data, comprising 68,526 women aged 30–49 years, is used for the study. Levesque’s framework of healthcare access directed variable selection across approachability, acceptability, affordability, availability, and appropriateness dimensions to decide on the set of explanatory covariates. We applied three single learners—Logistic Regression (LR), Naïve Bayes (NB), and Decision Tree (DT)—and two ensemble learners—Random Forest (RF) and XGBoost (XGB)—to train on balanced weighted data. Given the risk of overfitting after the synthetic minority oversampling technique (SMOTE), predictive performance was validated using 10-fold cross-validation. Five evaluation metrics were compared to select the best learner predicting the screening uptake. Inequality was measured using conventional and algorithm-based concentration indices and decomposed using algorithm-based feature importance and feature-specific inequality scores to estimate contributions to three inequality-health gradients in screening access.

In India, remarkably low (0.9%) screening uptake with clear economic, educational, and social disparities is evident. Although Random Forest and XGBoost performed with higher predictive accuracy (96%) and explainability (AUROC = 0.99), Decision Tree brought stable generalizability (mean AUROC = 0.995) after 10-fold validation. Feature importance results indicate that education, autonomy, interactions with community health workers, provincial and spatial features explain most of the variability. Proximity, transport availability, hesitancy in unaccompanied care seeking, and financial constraints were access barriers with limited contribution to the variation in screening uptake. Concentration index estimates reflect a pro-rich (0.1, p < 0.001), pro-educated (0.182, p < 0.001), and pro-marginalized social gradient (−0.011, p < 0.05). Tree-based decomposition predicts higher affordability, and education deepens pro-rich and pro-educated inequalities but can be an effective policy instrument to mitigate social position-based disparities if contributions can be increased. Access-related barriers intensified inequality across all gradients. Nevertheless, factors that enable access flatten the gradients.

Machine learning models can improve decision making, enhancing accuracy and precision in inequity prediction for breast cancer screening uptake and revealing crucial gradients and access barriers shaping breast cancer screening uptake in India. ML-based predictions that offer higher explainability suggest that financial protection, spatial accessibility to health centers, access to education, autonomy, higher contact with community health workers, and community-based awareness programs targeting poor, less educated, socially disadvantaged middle-aged women are likely to smooth the economic, educational disparities in screening coverage, claiming a requirement of deeper investigation with respect to social gradients.

## Linked entities

- **Diseases:** breast cancer (MONDO:0004989)

## Full-text entities

- **Genes:** SHROOM4 (shroom family member 4) [NCBI Gene 57477] {aka MRXSSDS, SHAP, shrm4}
- **Diseases:** Cancer (MESH:D009369), CI (MESH:C567712), Breast cancer (MESH:D001943), anxiety (MESH:D001007), NP-NCD (MESH:D000073296), STI (MESH:D012749), ML (MESH:C537366), cervical cancer (MESH:D002583), death (MESH:D003643)
- **Species:** Human immunodeficiency virus 1 (no rank) [taxon 11676], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12820423/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12820423/full.md

## References

49 references — full list in the complete paper: https://tomesphere.com/paper/PMC12820423/full.md

---
Source: https://tomesphere.com/paper/PMC12820423