# Machine Learning Model for Predicting Pathological Invasiveness of Pulmonary Ground‐Glass Nodules Based on AI‐Extracted Radiomic Features

**Authors:** Guozhen Yang, Yuanheng Huang, Huiguo Chen, Weibin Wu, Yonghui Wu, Kai Zhang, Xiaojun Li, Jiannan Xu, Jian Zhang

PMC · DOI: 10.1111/1759-7714.70128 · Thoracic Cancer · 2025-07-31

## TL;DR

This study developed a machine learning model using CT scan features to accurately predict whether lung nodules are invasive or not, helping doctors make better surgical decisions.

## Contribution

A simplified machine learning model using AI-extracted radiomic features for preoperative risk stratification of pulmonary ground-glass nodules.

## Key findings

- The Gradient Boosting Machine model achieved an AUC of 0.965 in external validation for predicting nodule invasiveness.
- Median CT value and skewness were identified as the most influential predictors of invasiveness.
- The model demonstrated strong accuracy (88.1%) and F1 score (0.87) in external validation.

## Abstract

With the widespread adoption of low‐dose CT screening, the detection of pulmonary ground‐glass nodules (GGNs) has risen markedly, presenting diagnostic challenges in distinguishing preinvasive lesions from invasive adenocarcinomas (IAC). This study aimed to develop a machine learning (ML)–based model using artificial intelligence (AI)‐extracted CT radiomic features to predict the invasiveness of GGNs.

A retrospective cohort of 285 patients (148 with preinvasive lesions, 137 with IAC) from the Lingnan Campus was divided into training and internal validation sets (8:2). An independent cohort of 210 patients (118 with preinvasive lesions, 92 with IAC) from the Tianhe Campus served as external validation. Nineteen radiomic features were extracted and filtered using Boruta and LASSO algorithms. Seven ML classifiers were evaluated using AUC‐ROC, decision curve analysis (DCA), and SHAP interpretability.

Median CT value, skewness, 3D long‐axis diameter, and transverse diameter were ultimately selected for model construction. Among all classifiers, the Gradient Boosting Machine (GBM) model achieved the best performance (AUC = 0.965 training, 0.908 internal validation, and 0.965 external validation). It demonstrated strong accuracy (88.1%), specificity (80.7%), and F1 score (0.87) in the external validation cohort. The GBM model demonstrated superior net clinical benefit. SHAP analysis identified median CT value and skewness as the most influential predictors.

This study presents a simplified ML model using AI‐extracted radiomic features, which has strong predictive performance and biological interpretability for preoperative risk stratification of GGNs. By leveraging median CT value, skewness, 3D long‐axis diameter, and transverse diameter, the model enables accurate and noninvasive differentiation between IAC and indolent lesions, supporting precise surgical planning.

A retrospective cohort of 495 patients with pathologically confirmed GGNs (AAH/AIS/MIA: 266; IAC: 229) from two centers was analyzed. AI‐extracted CT radiomic features (n = 19) were processed via a two‐stage feature selection (Boruta algorithm + LASSO regression). Seven ML models were trained/validated using an 8:2 internal split and external cohort (n = 210).

Four features were selected: median CT value, skewness, 3D long diameter, and transverse diameter. The Gradient Boosting Machine (GBM) model achieved superior performance:Training cohort (n = 285):

AUC 0.965 (95% CI: 0.944–0.985)
Internal validation (n = 114): AUC 0.908 (95% CI: 0.824–0.992)External validation (n = 210): AUC 0.965 (95% CI: 0.945–0.984)SHAP analysis confirmed median CT value as the top predictor (p < 0.001)Decision curve analysis demonstrated clinical utilityAn AI‐derived radiomic model accurately stratifies GGN invasiveness using four CT features. Its high performance in multi‐center validation supports integration into preoperative workflows to personalize surgical management.

Internal validation (n = 114): AUC 0.908 (95% CI: 0.824–0.992)

External validation (n = 210): AUC 0.965 (95% CI: 0.945–0.984)

SHAP analysis confirmed median CT value as the top predictor (p < 0.001)

Decision curve analysis demonstrated clinical utility

## Linked entities

- **Diseases:** adenocarcinomas (MONDO:0004970)

## Full-text entities

- **Genes:** SHROOM4 (shroom family member 4) [NCBI Gene 57477] {aka MRXSSDS, SHAP, shrm4}, CEACAM3 (CEA cell adhesion molecule 3) [NCBI Gene 1084] {aka CD66D, CEA, CGM1, CGM1a, W264, W282}, KRT19 (keratin 19) [NCBI Gene 3880] {aka CK19, K19, K1CS}, SERPINB3 (serpin family B member 3) [NCBI Gene 6317] {aka HsT1196, SCC, SCCA-1, SCCA-PD, SCCA1, SSCA1}, ENO2 (enolase 2) [NCBI Gene 2026] {aka HEL-S-279, NSE}
- **Diseases:** calcifications (MESH:D002114), adenomatous precursor lesions (MESH:D011125), Lung (MESH:D008171), GGNs (MESH:C000721427), AI (MESH:C538142), pulmonary nodule (MESH:D055613), thoracic tumors (MESH:D013899), lung cancer (MESH:D008175), AAH (MESH:D004714), necrosis (MESH:D009336), GBM (MESH:D000141), malignancy (MESH:D009369), IAC (MESH:D000230), fibrosis (MESH:D005355), nodule (MESH:D016606), AIS (MESH:D065311)
- **Chemicals:** paraffin (MESH:D010232)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12313823/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12313823/full.md

## References

39 references — full list in the complete paper: https://tomesphere.com/paper/PMC12313823/full.md

---
Source: https://tomesphere.com/paper/PMC12313823