# Machine learning-based prediction of invasiveness in lung adenocarcinoma presenting as ground-glass nodules using radiomics and clinical CT features

**Authors:** Mingzhi Lin, Longqian Li, Yiming Hui, Bin Li, Yue Li, ChongRui Li, Zhizhong Zheng, Zhuowen Yang

PMC · DOI: 10.1186/s12885-025-14983-3 · 2025-11-03

## TL;DR

This study uses machine learning with CT scans and clinical data to predict the invasiveness of lung cancer nodules, improving preoperative decision-making.

## Contribution

A novel machine learning framework combining radiomics and clinical CT features to predict lung adenocarcinoma invasiveness in ground-glass nodules.

## Key findings

- The Random Forest model achieved an AUC of 0.854 in training and 0.778 in external validation for predicting invasiveness.
- PCA-derived radiomic components and clinical CT features were key predictors in the best-performing model.
- The model outperformed clinical-only and LASSO-based radiomics models in predictive accuracy.

## Abstract

Lung adenocarcinoma(LA), the predominant histological subtype of lung cancer, frequently manifests as ground-glass nodules (GGNs) on computed tomography. Preoperative discrimination of invasiveness—critical for guiding surgical and therapeutic decisions—remains challenging due to subjective radiological assessment and limited sensitivity of conventional methods. This multicenter study aimed to develop a robust, non-invasive predictive framework integrating radiomics and clinical CT features using machine learning (ML) to stratify GGN-associated LA invasiveness.

A retrospective dual-cohort analysis was conducted on 357 patients with pathologically confirmed LA. The primary cohort (n = 312) was randomly divided into a training cohort (n = 249) and a test cohort (n = 63) at an 8:2 ratio. The external validation cohort consisted of 45 patients. Radiomics features (n = 1129) were extracted from High Resolution CT (HRCT), and clinical CT features (n = 16) were evaluated by blinded radiologists. Principal component analysis (PCA) and least absolute shrinkage and selection operator (LASSO) were respectively used for dimensionality reduction of radiomics features and five ML algorithms (XGBoost, SVM, Random Forest, Logistic Regression, LightGBM) were trained to predict invasiveness (low: minimally invasive adenocarcinoma/Grade 1 invasive adenocarcinoma; high: Grade 2/3 invasive adenocarcinoma). Model performance was assessed using Area Under the Curve (AUC), sensitivity, specificity, and Decision Curve Analysis. The calibration curve was plotted, and SHapley Additive exPlanations methods were used to interpret the predictive models.

The Random Forest model In the Clinical CT Features-PCA radiomics model performed the best, with an AUC value of 0.854 for the training cohort, 0.769 for the test cohort, and 0.778 for the external validation cohort. Key predictive features included PCA-derived radiomic components and clinical CT Features. Clinical CT Features-PCA Radiomics RF model significantly outperformed clinical-only models and Clinical CT Features-LASSO Radiomics Model, showing superior predictive ability.

Integration of radiomics and clinical CT features via ML, particularly RF, enables accurate preoperative prediction of LA invasiveness in GGNs. This approach enhances objectivity over conventional radiological assessment and may optimize personalized treatment strategies. Further validation in larger, prospective cohorts is warranted to confirm clinical utility.

The online version contains supplementary material available at 10.1186/s12885-025-14983-3.

## Linked entities

- **Diseases:** lung adenocarcinoma (MONDO:0005061), lung cancer (MONDO:0005138)

## Full-text entities

- **Diseases:** lung adenocarcinoma (MESH:D000077192)

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12581264/full.md

---
Source: https://tomesphere.com/paper/PMC12581264