# Gene Expression-Based Colorectal Cancer Prediction Using Machine Learning and SHAP Analysis

**Authors:** Yulai Yin, Zhen Yang, Xueqing Li, Shuo Gong, Chen Xu

PMC · DOI: 10.3390/genes17010114 · 2026-01-20

## TL;DR

This study developed a machine learning model using gene expression data to predict colorectal cancer with high accuracy, identifying key genes that could aid in early diagnosis.

## Contribution

The novel contribution is the development of a CRC genetic diagnostic model using ten genes and XGBoost with strong predictive performance validated across datasets.

## Key findings

- A genetic diagnostic model using ten genes achieved an AUC of 0.9875 in training and 0.9601 in validation.
- XGBoost outperformed other machine learning algorithms with an AUC of 0.990.
- SHAP analysis identified IFITM1 and DBNDD1 as the most influential genes in the model.

## Abstract

Objective: To develop and validate a genetic diagnostic model for colorectal cancer (CRC). Methods: First, differential expression genes (DEGs) between colorectal cancer and normal groups were screened using the TCGA database. Subsequently, a two-sample Mendelian randomization analysis was performed using the eQTL genomic data from the IEU OpenGWAS database and colorectal cancer outcomes from the R12 Finnish database to identify associated genes. The intersecting genes from both methods were selected for the development and validation of the CRC genetic diagnostic model using nine machine learning algorithms: Lasso Regression, XGBoost, Gradient Boosting Machine (GBM), Generalized Linear Model (GLM), Neural Network (NN), Support Vector Machine (SVM), k-Nearest Neighbors (KNN), Random Forest (RF), and Decision Tree (DT). Results: A total of 3716 DEGs were identified from the TCGA database, while 121 genes were associated with CRC based on the eQTL Mendelian randomization analysis. The intersection of these two methods yielded 27 genes. Among the nine machine learning methods, XGBoost achieved the highest AUC value of 0.990. The top five genes predicted by the XGBoost method—RIF1, GDPD5, DBNDD1, RCCD1, and CLDN5—along with the five most significantly differentially expressed genes (ASCL2, IFITM3, IFITM1, SMPDL3A, and SUCLG2) in the GSE87211 dataset, were selected for the construction of the final colorectal cancer (CRC) genetic diagnostic model. The ROC curve analysis revealed an AUC (95% CI) of 0.9875 (0.9737–0.9875) for the training set, and 0.9601 (0.9145–0.9601) for the validation set, indicating strong predictive performance of the model. SHAP model interpretation further identified IFITM1 and DBNDD1 as the most influential genes in the XGBoost model, with both making positive contributions to the model’s predictions. Conclusions: The gene expression profile in colorectal cancer is characterized by enhanced cell proliferation, elevated metabolic activity, and immune evasion. A genetic diagnostic model constructed based on ten genes (RIF1, GDPD5, DBNDD1, RCCD1, CLDN5, ASCL2, IFITM3, IFITM1, SMPDL3A, and SUCLG2) demonstrates strong predictive performance. This model holds significant potential for the early diagnosis and intervention of colorectal cancer, contributing to the implementation of third-tier prevention strategies.

## Linked entities

- **Genes:** RIF1 (replication timing regulatory factor 1) [NCBI Gene 55183], GDPD5 (glycerophosphodiester phosphodiesterase domain containing 5) [NCBI Gene 81544], DBNDD1 (dysbindin domain containing 1) [NCBI Gene 79007], RCCD1 (RCC1 domain containing 1) [NCBI Gene 91433], CLDN5 (claudin 5) [NCBI Gene 7122], ASCL2 (achaete-scute family bHLH transcription factor 2) [NCBI Gene 430], IFITM3 (interferon induced transmembrane protein 3) [NCBI Gene 10410], IFITM1 (interferon induced transmembrane protein 1) [NCBI Gene 8519], SMPDL3A (sphingomyelin phosphodiesterase acid like 3A) [NCBI Gene 10924], SUCLG2 (succinate-CoA ligase GDP-forming subunit beta) [NCBI Gene 8801]
- **Diseases:** colorectal cancer (MONDO:0005575)

## Full-text entities

- **Genes:** RIF1 (replication timing regulatory factor 1) [NCBI Gene 55183], ASCL2 (achaete-scute family bHLH transcription factor 2) [NCBI Gene 430] {aka ASH2, HASH2, MASH2, bHLHa45}, SUCLG2 (succinate-CoA ligase GDP-forming subunit beta) [NCBI Gene 8801] {aka G-SCS, GBETA, GTPSCS}, SMPDL3A (sphingomyelin phosphodiesterase acid like 3A) [NCBI Gene 10924] {aka ASM3A, ASML3a, yR36GH4.1}, IFITM3 (interferon induced transmembrane protein 3) [NCBI Gene 10410] {aka 1-8U, DSPA2b, IP15}, GDPD5 (glycerophosphodiester phosphodiesterase domain containing 5) [NCBI Gene 81544] {aka GDE2, PP1665}, ITIH3 (inter-alpha-trypsin inhibitor heavy chain 3) [NCBI Gene 3699] {aka H3P, ITI-HC3, SHAP}, DBNDD1 (dysbindin domain containing 1) [NCBI Gene 79007], IFITM1 (interferon induced transmembrane protein 1) [NCBI Gene 8519] {aka 9-27, CD225, DSPA2a, IFI17, LEU13}, RCCD1 (RCC1 domain containing 1) [NCBI Gene 91433], CLDN5 (claudin 5) [NCBI Gene 7122] {aka AWAL, BEC1, CPETRL1, TMDVCF, TMVCF}
- **Diseases:** CRC (MESH:D015179)

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12841458/full.md

---
Source: https://tomesphere.com/paper/PMC12841458