# Generation of machine-learning derived cancer vulnerability indicator to determine the spatial burden of cancer outcomes

**Authors:** Kou Kou, Jessica Cameron, Paramita Dasgupta, Hao Chen, Peter D. Baade, Godwin Banafo Akrong, Godwin Banafo Akrong, Godwin Banafo Akrong

PMC · DOI: 10.1371/journal.pone.0319539 · PLOS One · 2026-02-20

## TL;DR

This study creates a lung cancer vulnerability index using area-level data to explain geographic variations in cancer outcomes.

## Contribution

A novel non-parametric dimensionality reduction approach generates a cancer vulnerability index from high-dimensional data.

## Key findings

- The top predictors for lung cancer incidence were diabetes prevalence and adequate fruit intake.
- The LcVI explained 57% of the variation in cancer incidence rates across geographic areas.
- Areas with incidence rates below average had significantly lower LcVI scores than those with average rates.

## Abstract

Due to the difficulty of obtaining population-based individual-level data, ecological studies are often used to explore factors related to geographic variations in health outcomes. This study proposes a novel framework to identify area-level predictors of spatial variations in lung cancer outcomes and generate a lung cancer vulnerability index (LcVI) based on these predictors.

Data on 11,313 persons diagnosed with invasive lung cancer in Queensland, Australia (2016–2019) were sourced from the population-based Queensland Cancer Register. Bayesian spatial models estimated smoothed standardised incidence ratios (SIRs) for 519 geographic areas. Area-level variables (n = 911) were extracted from multiple data collections. Random forest models were fitted to identify important predictors for lung cancer incidence rates. A novel non-parametric dimensionality reduction approach incorporating the final random forest model results was developed to generate the LcVI which ranged from 0–10.

Eight variables were identified as predictors for lung cancer incidence with the top two being the prevalence of diabetes and adequate fruit intake. Areas having incidence rates below the Queensland average had significantly lower LcVI than those with average incidence rates (mean difference = 2.80, 95% CI: 2.34–3.25, p < 0.001) while areas with above average incidence rates had significantly higher LcVI than those with average incidence (mean difference = 2.70, 95% CI: 2.20–3.19, p < 0.001). The LcVI was strongly associated with the continuous SIR, explaining 57% of the variation (R² = 0.57, p < 0.001).

This novel approach identified a small number of important predictors for lung cancer incidence from a high-dimensional dataset. The lung cancer vulnerability index partially explained the geographic variations, potentially offering insights into underlying drivers. As an ecological analysis, this associations reflect relationships at the population level. Future research incorporating individual-level data is needed to confirm whether the area-level associations observed here hold true for individuals.

## Linked entities

- **Diseases:** lung cancer (MONDO:0005138), diabetes (MONDO:0005015)

## Full-text entities

- **Diseases:** PHIDU (MESH:D002658), invasive cancers (MESH:D009362), skin cancer (MESH:D012878), Lung Cancer (MESH:D008175), ML (MESH:C537366), diabetes (MESH:D003920), Cancer (MESH:D009369), IRSD (MESH:D000080822)
- **Chemicals:** PONE-D-25-05126R1 (-)
- **Species:** Homo sapiens (human, species) [taxon 9606], Nicotiana tabacum (American tobacco, species) [taxon 4097]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12923003/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12923003/full.md

## References

52 references — full list in the complete paper: https://tomesphere.com/paper/PMC12923003/full.md

---
Source: https://tomesphere.com/paper/PMC12923003