# Risk factors and prediction of distant metastasis (DM) of colon adenocarcinoma: a logistic regression and machine learning study based on surveillance, epidemiology, and end results (SEER) database

**Authors:** Qiang Guo, Junyun Li, Zhe Wei, Jingjing Xu, Shaojun Duan, Jianfeng Li, Yaxi Liu

PMC · DOI: 10.1186/s12885-025-14329-z · BMC Cancer · 2025-07-01

## TL;DR

This study uses machine learning and logistic regression to identify risk factors and predict distant metastasis in colon adenocarcinoma patients using data from SEER and a hospital dataset.

## Contribution

The study introduces a machine learning approach to predict distant metastasis in colon cancer, identifying key risk factors with high predictive accuracy.

## Key findings

- Logistic regression identified 8 independent risk factors for distant metastasis in colon adenocarcinoma.
- The LR model achieved an AUC of 0.892 on the test set and 0.868 on the external validation set.
- Machine learning models showed promising predictive performance for detecting distant metastasis.

## Abstract

Given the limitations of traditional imaging examinations to detect distant metastasis (DM) (e.g., low sensitivity), this study is to identify pathological and laboratory risk factors and establish models predicting distant metastasis of colon adenocarcinoma (CA) patients.

CA Patients diagnosed between the year of 2018 and 2021 were retrieved from SEER. Logistic regression was utilized to find independent risk factors (IRFs) of DM and 12 models including BNB (Bernoulli naïve bayes), DT (Decision tree), GBC (Gradient Boosting Classifier), GNB (Gaussian naïve bayes), KNN (K-nearest neighbor), LDA (Linear Discriminant Analysis), LR (Logistic regression), MLP (Multi-layer perceptron classifier), MNB (Multinomial naïve bayes), QDA (Quadratic discriminant analysis), RFC (Random forest classifier) and SVC (Support vector machine) were established and evaluated on the training set and test set (7:3) of the retrieved patients. Additionally, CA patient data was collected from Jincheng People’s Hospital (JCPH) as an external validation set for the prediction efficacy of the models.

7,000 and 83 CA patients were retrieved from SEER and JCPH respectively, and 8 IRFs including age 60–79 (OR = 0.589, 95% CI: 0.391–0.887) and age > 80 (OR = 0.456, 95% CI: 0.287–0.722), primary site – cecum (OR = 1.305, 95% CI: 1.023–1.664), TNM stage – T3 (OR = 8.869, 95% CI: 2.151–36.569) and T4 (OR = 15.912, 95% CI: 3.839–65.955), TNM stage – N1 (OR = 3.853, 95% CI: 2.919–5.087) and N2 (OR = 8.480, 95% CI: 6.322–11.374), number of regional nodes examined > 12 (OR = 0.439, 95% CI: 0.326–0.591), tumor deposits (OR = 1.989, 95% CI: 1.639–2.414), carcinoembryonic antigen (CEA) level (OR = 4.552, 95% CI: 3.747–5.530) and perineural invasion (OR = 1.352, 95% CI: 1.112–1.643) were identified. LR showed the best predictive efficacy both on the test (AUC = 0.892, sensitivity = 0.825, specificity = 0.801) and external validation set (AUC = 0.868, sensitivity = 1.000, specificity = 0.727).

Machine learning is a promising way to assist the detection of DM for CA patients.

## Linked entities

- **Diseases:** colon adenocarcinoma (MONDO:0002271)

## Full-text entities

- **Diseases:** CA (MESH:D003110), DM (MESH:D009362)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12211135/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12211135/full.md

## References

6 references — full list in the complete paper: https://tomesphere.com/paper/PMC12211135/full.md

---
Source: https://tomesphere.com/paper/PMC12211135