# Training Set Augmentation and Harmonization Enables Radiomic Models to Detect Early Onset of Lung Cancer

**Authors:** Claire Huchthausen, Menglin Shi, Gabriel L.A. Sousa, James Larner, Einsley Janowski, Jonathan Colen, Krishni Wijesooriya

PMC · DOI: 10.21203/rs.3.rs-7350820/v1 · Research Square · 2025-09-29

## TL;DR

This study shows that training machine learning models with augmented and harmonized data improves early detection of lung cancer from CT scans.

## Contribution

The study introduces a method combining training set augmentation and biological-aware harmonization to improve radiomic model performance.

## Key findings

- Models trained with augmented data and biological-aware harmonization achieved higher test ROC-AUC (0.72) compared to others.
- Harmonization without biological distinction failed to improve model performance significantly.
- Separate harmonization of datasets also improved performance but slightly less than using a covariate-based approach.

## Abstract

Radiomics-based machine learning models have the potential to detect lung cancer at inception from CT scans and transform patient outcomes. Low malignancy rates in early-development pulmonary nodules (PNs) and variable image acquisition hinder development of clinically applicable radiomics-based early detection models. To address these challenges, we augmented training using later-development PNs and harmonized for acquisition effects. We first trained machine learning models to predict PN malignancy using radiomic features from scans of early-development benign and malignant PNs (n = 187) harmonized using ComBat. Observing near-chance performance, we augmented training with later-development benign and malignant PNs (n = 225). We evaluated whether harmonization must incorporate biological differences that impact acquisition effects in added training data. To correct features for variability in four acquisition parameters, we compared: 1) harmonization without biological distinction, 2) harmonizing with a covariate distinguishing early-development, benign augmentation, malignant augmentation training datasets, 3) harmonizing each dataset separately. Models trained using augmented data harmonized without biological distinction failed to improve. Models trained on augmented data harmonized with a covariate (ROC-AUC 0.72 [0.67–0.76]) or separately (ROC-AUC 0.69 [0.63–0.74]) achieved significantly higher test ROC-AUC (Delong test, adjusted p ≤ 0.05). Our findings lay groundwork for clinically viable radiomics tools harnessing routine screening imaging for lung cancer early detection.

## Linked entities

- **Diseases:** lung cancer (MONDO:0005138)

## Full-text entities

- **Diseases:** Lung Cancer (MESH:D008175), PNs (MESH:D055613), PN malignancy (MESH:C565820), malignancy (MESH:D009369)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12622160/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12622160/full.md

## References

37 references — full list in the complete paper: https://tomesphere.com/paper/PMC12622160/full.md

---
Source: https://tomesphere.com/paper/PMC12622160