Identification and validation of butyrate metabolism-related biomarkers for colorectal cancer diagnosis
Miao Yu, Qian Chen, Tianhe Gu, Yiping Lu

TL;DR
This study identifies six biomarkers related to butyrate metabolism that could help diagnose colorectal cancer and improve understanding of its metabolic changes.
Contribution
The study introduces six validated butyrate metabolism-related biomarkers for colorectal cancer diagnosis using multi-omics and experimental validation.
Findings
Sixty-three differentially expressed butyrate metabolism-related genes were identified and enriched in CRC-related pathways.
Six hub biomarkers (CCND1, CXCL8, MMP3, MYC, TIMP1, and VEGFA) were validated using PPI, LASSO, and ROC analyses.
qRT-PCR confirmed upregulation of CCND1, CXCL8, MYC, and VEGFA in CRC cell lines.
Abstract
Unlike normal colon cells with butyrate acid as the main energy source, cancerous colon cells are more inclined to use glucose. However, the mechanisms of the investigation into the modulatory role of butyrate metabolism within the pathophysiology of colorectal cancer (CRC) remains insufficiently explored. The study analyzed four datasets (The Cancer Genome Atlas (TCGA)-COAD, TCGA-READ, GSE41258, and GSE39582) and gene sets related to butyrate metabolism-related genes (BMGs) in an integrated manner. Differentially expressed BMGs (DE-BMGs) were screened by overlapping BMGs, TCGA-DEGs between CRC and normal groups, and Gene Expression Omnibus (GEO)-differentially expressed genes (DEGs) between CRC and normal groups and were subjected to enrichment analysis. Hub genes were then screened via protein–protein interaction (PPI) network analysis. Biomarker selection was improved by applying…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9- —Beijing Hospital Administration Center grants Beijing Municipal Hospital Research and Cultivation Program
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsColorectal Cancer Treatments and Studies · Colorectal Cancer Surgical Treatments · Cancer, Hypoxia, and Metabolism
Introduction
Colorectal cancer (CRC) is a significant global health burden, accounting for approximately 9.2% of all cancer-related deaths worldwide, with five- and 10-year survival rates of 65% and 58%, respectively (Bray et al., 2024). There is a rapid escalation in the incidence and mortality of CRC within the Chinese population (Morgan et al., 2023). Early detection remains a significant challenge due to the inherent limitations of current diagnostic methods (Zygulska & Pierzchalski, 2022). Epidemiological investigations have demonstrated a correlation between the rise in CRC incidence and progression and increased consumption of a meat-centric diet and a concurrent reduction in dietary fiber intake (Vernia et al., 2021). After digestion and absorption, the fibers flow to the colon and are fermented by intestinal microbes. Changes in food structure can affect the intestine by of gastrointestinal microorganisms affect CRC occurrence (Elbaylek & Ammor, 2024).
Butyrate, a short-chain fatty acid (SCFA) derived from the microbial fermentation of dietary fiber (Panebianco et al., 2022), is a crucial metabolite with protective roles against gastrointestinal diseases, particularly CRC (Guo et al., 2020). It serves as the primary energy source for normal colonocytes, reinforces intestinal barrier integrity, and exerts potent anti-inflammatory effects (Chen et al., 2020). Notably, a metabolic shift occurs in CRC, where cancerous cells preferentially utilize glucose over butyrate, leading to its accumulation (Avuthu & Guda, 2022). This shift may paradoxically enhance butyrate’s tumor-suppressive functions, including the inhibition of cell proliferation and the induction of apoptosis, thereby contributing to its protective role against CRC (Zhou et al., 2022).
Despite extensive research on butyrate’s general anticancer properties, a critical knowledge gap persists in systematically delineating the specific gene networks through which butyrate metabolism influences CRC pathogenesis. Previous transcriptomic studies have primarily focused on either general differentially expressed genes or broader microbiome composition, without specifically targeting the core gene regulatory machinery of butyrate metabolism. This limitation has hindered the development of precise biomarker signatures reflecting this metabolically crucial pathway.
To address this critical gap, our study employed an integrated bioinformatics approach to identify and characterize a novel butyrate metabolism-related gene (BMG) signature in CRC. Unlike previous transcriptomic analyses, we specifically interrogated genes within butyrate metabolic pathways using data from The Cancer Genome Atlas (TCGA). Through protein–protein interaction (PPI) network analysis and ingenuity pathway analysis (IPA), we identified a robust cohort of six candidate diagnostic genes. Furthermore, we constructed a competing endogenous RNA network to elucidate potential regulatory mechanisms and performed drug-target prediction analysis. This comprehensive strategy provides unprecedented insights into the gene regulatory networks influenced by butyrate metabolism in CRC, offering new perspectives for understanding its complex role in tumorigenesis.
Materials and Methods
Data source
The compilation of this study’s TCGA-CRC dataset entailed the amalgamation of transcriptomic and clinical data pertaining to colorectal (TCGA-COAD) and rectal (TCGA-READ) cancers, as procured from the TCGA repository (https://xenabrowser.net) on February 10, 2023. This dataset is comprehensive of RNA sequencing data from 638 CRC cases, with an extended subset of 717 instances encompassing survival and clinical particulars, in addition to 51 adjacent normal tissue samples from CRC patients. Concurrently, datasets GSE41258 and GSE39582, incorporating 186 and 443 CRC specimens, respectively, alongside 54 normal colon tissues from colon tumor patients and 19 non-neoplastic colorectal mucosal tissues from CRC patients, were retrieved from the Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/) on February 10, 2023. Subsequently, a cohort of 882 candidate BMGs were identified (Table S1), selected based on a relevance score of ≥5 in the GeneCards database (https://www.genecards.org/). This scoring system utilizes an algorithm that integrates multiple data sources, weighing factors such as gene–disease association evidence, expression patterns, and functional annotations to quantify relevance to the query term “butyrate metabolism”. The selected genes were further corroborated by antecedent research findings (Chuanbing et al., 2022). Portions of this text were previously published as part of a preprint (https://www.researchsquare.com/article/rs-3679613/v1).
Identification of differentially expressed BMGs (DE-BMGs)
Differentially expressed genes (DEGs) between CRC and normal samples were discerned within the TCGA-CRC and GSE41258 datasets (Wan et al., 2024), with an emphasis on an adjusted P-value of < 0.05 and a log_2_| fold change of > 1. This differential analysis was executed using the DESeq2 package (v 3.5.0.3) (Wu et al., 2021) and the limma package (v 3.54.1) (Jiang et al., 2024). Visual representations of the differential analysis results were generated through volcano plots and heatmaps, facilitated by the ggplot2 package (v 3.3.6) (Ito & Murphy, 2013) and the pheatmap package (v 1.0.12), respectively. The identification process of differentially expressed BMGs (DE-BMGs) was finally determined by assessing the overlap between BMGs, TCGA-DEGs, and GEO-DEGs using Venny 2.1.
Functional enrichment analysis of DE-BMGs
The Kyoto Encyclopedia of Genes and Genomes (KEGG), a database resource of utilities for understanding higher functions and biological systems from molecular-level information, especially large molecule (Kanehisa et al., 2025) screening criterion. The GO and KEGG analysis were executed employing the clusterProfiler package (v 4.2.2) (Kanehisa et al., 2023). The significance threshold for both GO and KEGG analyses was set at an adjusted P-value of < 0.05. The P-values were corrected using the Benjamini–Hochberg procedure.
Protein–protein interaction and least absolute shrinkage and selection operator analyses
PPI networks are composed of proteins that interact with each other to participate in various aspects of life processes such as biological signal transmission, gene expression regulation, energy and substance metabolism, and cell cycle regulation. A PPI network was constructed utilizing the DE-BMGs using the STRING database (https://string-db.org/) with a set interaction score threshold of ≥0.4. The PPI network’s graphical representation was facilitated by Cytoscape (v 3.9.0) (Zeng et al., 2022). Furthermore, the cytoHubba plugin was employed to conduct maximal clique centrality (MCC) analysis, and the top 10 ranked genes were identified as hub genes. Based on these identified hub genes, a least absolute shrinkage and selection operator (LASSO) analysis with 10-fold cross-validation was conducted using the cv.glmnet function from the glmnet package (family = binomial, alpha = 1, type.measure = “auc”). The optimal model was determined at the point where the model error was minimized. Feature genes were subsequently selected according to the optimal lambda value. For the evaluation of the feature genes’ diagnostic capabilities, the pROC package (v 1.18.0) (Beyene & El Ghouch, 2022) was employed to calculate the area under the curve (AUC) of receiver operating characteristic (ROC) curves, with an AUC value exceeding 0.7 indicating robust biomarker potential.
Expression-level validation in datasets
Biomarker expression profiles were garnered from the TCGA-CRC, GSE41258, and GSE39582 datasets. Comparative analysis of these profiles was conducted using the Wilcoxon rank-sum test (P < 0.05), and the results were visually presented using the ggpubr package (v 0.5.0) (Wang et al., 2023b).
Construction of the nomogram
In the TCGA-CRC dataset, a nomogram comprising biomarkers was constructed using the rms package (v 6.3-0) to predict the probability of developing CRC. The distribution of data was predefined using the datadist function from the rms package to facilitate subsequent model fitting and nomogram construction. A logistic regression model was developed using the lrm function. The model was fitted using the dataset data with the parameters x = TRUE and y = TRUE to retain the design matrix and response variable for downstream analysis. The established logistic regression model was subsequently converted into a nomogram using the nomogram function to provide a visual predictive tool. The inverse logit transformation function (plogis) was specified via the funlabel parameter to convert the linear predictor values into diagnostic probabilities. This probability axis was labeled “Risk of Disease” using the funlabel parameter. The display of the linear predictor scale was suppressed (lp = FALSE) to focus interpretation on probability. Custom tick marks for the probability axis were defined using the fun.at parameter to specify clinically relevant probability intervals. The model’s internal validation was assessed using bootstrap resampling with 1,000 repetitions via the validate function, and a calibration curve was plotted to compare predicted probabilities against observed outcomes.
Survival and clinical features analysis
Classifying the TCGA-CRC dataset into groups with high and low expression levels was based on the median expression values of the biomarkers. Subsequently, Kaplan–Meier (K–M) survival curves were generated with a two-sided log-rank test used by the survminer package (v 0.4.9) to compare overall survival differences between the two groups (P < 0.05). Clinical data, including age, gender, stage, and pathological classifications (pathologic T, pathologic N, and pathologic M), were extracted from the TCGA-CRC dataset to facilitate survival analysis. The distributional differences in these clinical parameters between the groups with divergent biomarker expression levels were assessed utilizing the tableone package (v 0.13.2) and chi-square test (P < 0.05). A multivariate Cox proportional hazards (PH) analysis was performed using the survival package, incorporating age, gender, tumor stage, pathologic_M, pathologic_N, pathologic_T, and the identified biomarkers. Variables with hazard ratios (HRs) significantly different from 1 and P-values < 0.05 were considered statistically significant. Prior to the multivariate analysis, the PH assumption was verified for all included variables, with a P-value > 0.05 indicating no violation of the assumption.
Ingenuity pathway analysis
IPA was performed on the DE-BMGs (Mahajan, Sarkar & Mondal, 2024). Then, the pathways that were significantly associated with the biomarkers were visualized (|Z-score| > 2).
Construction of a ceRNA regulatory network
To further understand the upstream regulatory mechanism of biomarkers, we constructed a competitive endogenous RNA (ceRNA) regulatory network. Differentially expressed long noncoding RNAs (DE-lncRNAs) and microRNAs (DE-miRNAs) between CRC and normal groups were discerned through the DESeq2 package (v 1.36.0) (Zhang et al., 2024), utilizing data from the TCGA-CRC dataset, with a stringent selection criterion of an adjusted P-value <0.05. The miRWalk database (http://mirwalk.umm.uni-heidelberg.de/) (Zeng et al., 2022) was deployed to predict miRNAs targeting the identified biomarkers. Co-expression patterns of the DE-miRNAs with predicted miRNAs were established by intersecting both sets. Additionally, lncRNAs potentially regulating these co-expressed miRNAs were identified through the miRanda algorithm (http://www.microrna.org/). These co-DE-lncRNAs were further scrutinized for their overlapping presence with the identified lncRNAs. Relationships between mRNAs, miRNAs, and lncRNAs, particularly where the expression of miRNAs is inversely correlated with that of the biomarkers and lncRNAs, were elucidated. This intricate network was visualized using the Cytoscape software (v 3.9.0) (Bhola et al., 2025), leading to the construction of an integrated lncRNA-miRNA-mRNA network predicated on these relational dynamics.
Construction of the biomarkers–drug interaction network
To develop biomarker-targeted drugs and provide reference for targeted therapy of CRC, pharmacological agents directed at the biomarkers were identified using the DGIdb database (https://dgidb.org/). A comprehensive network mapping key gene–drug interactions was then established and rendered using Cytoscape software (version 3.9.0) (Leng et al., 2022).
The quantitative real-time polymerase chain reaction
Initially, cell lines were propagated, with three replicates of each line being utilized. Specifically, NCM-460 (normal colonic epithelial cells) was purchased from ORiCells Biotechnology, and LOVO, HCT116, LS174T, and LS513 (colon cancer cells) were purchased from Procell. All cell lines were authenticated by short tandem repeat profiling and confirmed to be free from mycoplasma contamination before experimental use. The NCM-460 cell line was resuscitated in DMEM complete culture medium. The HCT116 cell line was resuscitated in McCOY’s 5A complete culture medium. The LOVO cell line was resuscitated in F12K complete culture medium. The LS174T and LS513 cell lines were resuscitated in 1640 complete culture medium. All the media contained 10% fetal bovine serum and 1% penicillin-streptomycin. After the cell density reached 90%, the cells were passaged at a ratio of 1:3 in T25 flasks. All the cells were cultured in an incubator at 37 °C with 5% CO_2_. The cell culture medium was aspirated and discarded, and an appropriate amount of 0.25% trypsin was added for digestion. When most of the cells became round and separated from each other, an appropriate amount of medium containing fetal bovine serum was added to terminate the digestion, and a single-cell suspension was prepared. Subsequently, cell samples were lysed with TRIZOL reagent, and the total RNA was isolated as per the protocols provided by the manufacturer. RNA concentration was quantified using the NanoPhotometer N50. Following quantification, reverse transcription was performed to synthesize cDNA with the SureScript First strand cDNA synthesis kit. The quantitative real-time (qRT-PCR) reactions included three µL of the reverse transcription product, five µL of 2xUniversal Blue SYBR Green qPCR Master Mix, and one µL each of forward and reverse primers, with all primer sequences detailed in Table 1. Primer specificity was confirmed via melt-curve analysis following amplification. GAPDH was utilized as the internal control, and gene expression levels were calculated using the 2^−ΔΔCT^ method (Talwar et al., 2023). Graphpad Prism 5 was employed for graphical representation and P-value computations.
Table 1: The information of primer sequence for quantitative real-time PCR (qRT-PCR).
Statistical analysis
All bioinformatics analyses were conducted using the R programming language. The Wilcoxon rank-sum test was utilized to differentiate the datasets between distinct groups.
Results
The 63 DE-BMGs were strongly connected with immunological and metabolic processes
First, 14,571 DEGs associated with TCGA-CRC were identified, comprising 10,241 upregulated and 4,330 downregulated genes (Figs. 1A–1B, Table S2). Concurrently, analysis of 596 GEO-DEGs between CRC and normal groups resulted in the identification of 241 upregulated and 382 downregulated genes (Figs. 1C–1D, Table S3). It is important to note that the number of DEGs identified in the TCGA dataset was substantially higher than that from the GEO dataset. This discrepancy is expected and commonly observed, primarily attributable to differences in their technological platforms: the RNA sequencing technology employed by TCGA offers higher sensitivity and a broader dynamic range compared to the microarray technology used in the GEO, enabling the detection of more low-abundance transcripts.
Screening and functional enrichment of differentially expressed- butyrate metabolism related genes (DE-BMGs).(A) The volcano map of The Cancer Genome Atlas-differentially expressed genes (TCGA-DEGs) in normal and colorectal cancer (CRC). The red dot represents up-regulated DEG, the green dot represents down-regulated DEG, and the black genes are not statistically significant. (B) The heatmap of TCGA-DEGs in normal and CRC. Red is up-regulated DEG, blue is down-regulated DEG. (C) The volcano map and (D) heatmap of Gene Expression Omnibus (GEO)-DEGs in normal and CRC. (E) The Venn diagram of common genes by overlapping genes up-regulated in TCGA-DEGs, genes up-regulated in GSE-DEGs, and BMGs. (F) The Venn diagram of common genes by intersecting genes down-regulated in TCGA-DEGs and genes down-regulated in GSE-DEGs and BMGs. (G) Gene ontology (GO) enrichment analysis of DE-BMGs. (H) Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis of DE-BMGs.
Subsequently, 63 DE-BMGs were distinguished through the intersection of BMGs, TCGA-DEGs, and GEO-DEGs (Figs. 1E–1F). The enrichment analysis revealed that these DE-BMGs are involved in 526 biological process entries, 24 cellular component entries, and 48 molecular function entries, as well as four KEGG pathways. These GO terms suggest that DE-BMGs are significantly involved in functions related to signaling receptor activator activity and receptor ligand activity, among others (Fig. 1G, Table S4). The KEGG pathway analysis further indicated their involvement in processes such as bile secretion and the peroxisome proliferator-activated receptor (PPAR) signaling pathway (Fig. 1H, Table S5).
CCND1, CXCL8, MMP3, MYC, TIMP1, and VEGFA emerged to be butyrate metabolism-related biomarkers in CRC
A PPI network was meticulously engineered, comprising 63 nodes and 223 interconnecting edges, wherein key regulatory genes such as VEGFA, MYC, MMP9, CXCL8, CCND1, TIMP1, SERPINE1, PLAU, SPP1, and MMP3 were distinguished as hub genes (Fig. 2A). Following a rigorous LASSO regression analysis, a subset of six salient feature genes—CCND1, CXCL8, MMP3, MYC, TIMP1, and VEGFA—were elucidated (Figs. 2B, 2C). The diagnostic prowess of these feature genes was substantiated by ROC curve analyses, which yielded AUC metrics surpassing the 0.85 threshold (CCND1 = 0.979, CXCL8 = 0.888, MMP3 = 0.933, MYC = 0.958, TIMP1 = 0.951, and VEGFA = 0.955), denoting exceptional predictive validity (Fig. 2D). Subsequent evaluative measures utilizing the GSE39582 dataset corroborated the robustness of the feature genes’ diagnostic accuracy, evidenced by AUC values in excess of 0.80 (CCND1 = 0.985, CXCL8 = 0.879, MMP3 = 0.911, MYC = 0.968, TIMP1 = 0.993, and VEGFA = 0.812) (Fig. 2E). Thereupon, these feature genes were definitively recognized as biomarkers (CCND1, CXCL8, MMP3, MYC, TIMP1, and VEGFA) for ensuing investigational pursuits.
Identification of biomarkers.(A) PPI network of hub gene based on MCC algorithm. The darker the color, the higher the MCC score of the gene. (B) The variation characteristic of the coefficient of variables in the Least absolute shrinkage and selection operator (LASSO) regression model based on the DE-MRFGs. (C) Cross-validation for tuning parameter selection in the LASSO regression analysis. The horizontal coordinate is log(Lambda), and the ordinate represents the error of cross validation. (D) Receiver Operating Characteristic (ROC) curves for six biomarkers in predicting the diagnostic performance in the TCGA-CRC. (E) ROC curves for six biomarkers in predicting the diagnostic performance in the GSE39582 datasets. AUC, area under curve.
All biomarkers were considerably higher in CRC
The analytical outcomes from the gene expression study indicated that biomarker levels were significantly higher in the CRC cohort relative to the control group within the training dataset (P < 0.0001) (Fig. 3A). Furthermore, examination of the expression trajectories for these biomarkers within the GSE41258 and GSE39582 datasets revealed a consistent pattern congruent with the findings from the training set (Figs. 3B–3C). Figure 3D depicted a nomogram of six biomarkers that predict diagnostic effectiveness in CRC. The calibration curve underscored the nomogram’s exceptional precision (Fig. 3E).
Expression of six biomarkers in CRC and normal samples from different datasets.(A) TCGA-CRC. (B) GSE41258. (C) GSE39582. (D) Feature genes nomogram. Each factor corresponds to a score, and the scores of each factor are added together to correspond to the total points. (E) Calibration curves of the nomogram. The horizontal coordinate is the predicted probability, and the vertical coordinate is the actual probability. The 45-degree line represents the ideal prediction.
Greater expression of CXCL8 and MMP3 might predict greater survival than TIMP1
Survival analysis revealed discernible disparities in survival rates among the differentially expressed groups for CXCL8, MMP3, and TIMP1 (Figs. 4A–4C). Specifically, higher survival rates were associated with elevated expressions of CXCL8 and MMP3, whereas an increased expression of TIMP1 was linked to lower survival rates. No significant survival correlation was observed for the other biomarkers (Fig. S1). Further examination of clinical parameters showed notable variations in survival rates across different age groups, disease stages, and pathologic classifications T, N, and M (Figs. 4D–4H). These variations were statistically substantiated by the analysis of variance, the details of which are presented in Tables 2–7. Notably, MMP3 expression levels significantly varied by disease stage and pathologic-T status, while MYC expression was significantly distinct in pathologic-T (Table 4). Differential expression of TIMP1 was markedly evident in pathologic-M (Table 6), and variations in VEGFA expression were significantly associated with both disease stage and pathologic-N status (Table 7). The multivariate Cox regression analysis revealed that age (P < 0.05, HR = 2.739 [1.755–4.276]), tumor stage (P < 0.05, HR = 8.639 [1.766–42.265]), pathologic_M (P < 0.05, HR = 1[1–1]), pathologic_N (P < 0.05, HR = 0.221 [0.079–0.62]), and TIMP1 expression (P < 0.05, HR = 26.493 [3.003–233.714]) served as independent prognostic factors (Fig. 5A). All these variables satisfied the PH assumption (P > 0.05) (Fig. 5B).
Survival and clinical feature analysis of biomarkers.Kaplan–Meier (K–M) curves for the overall survival (OS) of patients in the high-expression group and low-expression group with biomarkers (A) CXCL8, (B) MMP3, and (C) TIMP1. K–M curves for OS of patients in the high-expression group and low-expression group with different clinical features, including (D) age, (E) pathologic M stage, (F) tumor stage, (G) pathologic N stage, and (H) pathologic T stage.
Table 2: Distribution of clinical features in high-low CCND1 expression group.
Table 3: Distribution of clinical features in CXCL8 high-low expression group.
Table 4: Distribution of clinical features in MMP3 high-low expression group.
Table 5: Distribution of clinical features in MYC high-low expression group.
Biomarkers were integral to pathways influencing the tumor microenvironment
IPA illuminated that these biomarkers were integral to pathways, notably those influencing the tumor microenvironment (Fig. 6A). Figure 6B illustrates a comprehensive molecular interaction network diagram of tumor microenvironment (TME) pathways, visualizing the intricate interplay of cells, molecules, and signaling cascades within the TME.
Table 6: Distribution of clinical features in MYC high-low expression group.
Table 7: Distribution of clinical features in VEGFA high-low expression group.
Complex regulatory relationships in the lncRNA–miRNA–mRNA network
A comprehensive analysis identified 346 differentially expressed microRNAs (DE-miRNAs) between the CRC and control groups, of which 141 miRNAs were upregulated, and 205 were downregulated in CRC (Table S6). Moreover, 4,079 DE-lncRNAs were discerned between the two groups (Table S7). Biomarker analysis yielded predictions of 1,264 miRNAs targeting specific genes. Subsequent integrative analyses selected 23 co-DE-miRNAs by mapping the targeting miRNAs with DE-miRNAs (Fig. 7A). Additionally, 460 lncRNAs targeting the co-DE-miRNAs were identified via the miRanda algorithm. From this cohort, 104 co-DE-lncRNAs were chosen based on their overlap with the targeting lncRNAs and DE-lncRNAs (Fig. 7B). An intricate lncRNA–miRNA–mRNA regulatory network was then constructed, which included relationships between predicted mRNAs and miRNAs, as well as miRNAs and lncRNAs (Fig. 7C). This network featured four biomarkers, four miRNAs (hsa-miR-1295a, hsa-miR-760, hsa-miR-346, and hsa-miR-1270), and 51 lncRNAs (SNHG12, AC133785.1, etc.). Specific mRNA–miRNA interactions such as CXCL8-hsa-miR-1295a were identified, as well as miRNA–lncRNA pairs like CXCL8-SLC9A3-AS1 (Table S8).
The multivariable Cox regression and proportional hazards assumption testing.(A) Independent prognostic factors. (B) All these variables satisfied the PH assumption (P > 0.05).
The ingenuity pathway analysis (IPA).(A) IPA map for significant classical pathway enrichment results (top 10). Orange indicates that the pathway is activated, blue indicates that the pathway is inhibited, and the darker the color, the more active the action. (B) Interactions among biomolecules in tumor microenvironment pathways. Red is the upregulated gene, green is the downregulated gene, blue molecules in the network are suppressed, and orange molecules are activated.
The competing endogenous RNA (ceRNA) regulatory network.(A) The Venn diagram of 23 co-DE-miRNAs by overlapping the predicted microRNAs (miRNAs) and DE-miRNAs. (B) The Venn diagram of 104 co-DE-lncRNAs by overlapping the predicted long noncoding RNAs (lncRNAs) and DE-lncRNAs. (C) Construction of the lncRNA–miRNA–mRNA network. Blue indicates biomarkers, yellow indicates lncRNAs, and pink indicates miRNAs.
By focusing on biomarkers, a range of medications might be able to treat CRC
Utilizing the DGIdb database, five biomarkers—CCND1, CXCL8, MMP3, MYC, and VEGFA—were identified as targets for various therapeutic drugs (Fig. 7). The associated network encompasses a range of drugs, including 20 for CCND1 such as PALBOCICLIB and BRICICLIB; 56 for CXCL8 such as ABX-IL8 and HUMAX-IL8; 12 targeting MMP3 like MARIMASTAT, PRINOMASTAT, and similar compounds; 40 for MYC comprising BIZELESIN and PROTOPORPHYRIN, among others; and 38 for VEGFA, including AFLIBERCEPT, ZALTRAP, and related drugs (Fig. 8).
Prediction of therapeutic agents of biomarkers.Network of targeted drug for five biomarkers.
Validation of biomarker expression by qRT-PCR produced results consistent with the public databases
To corroborate the expression levels of specific biomarkers, qRT-PCR analyses were conducted on five distinct cell line samples, differentiating between cancerous and non-cancerous tissues. It was observed that the expression levels of CCND1, CXCL8, MYC, and VEGFA were significantly higher in the cancer cell lines compared to the normal tissue samples (P < 0.05), a finding in alignment with existing public database records (Figs. 9A–9D). In contrast, the expressions of MMP3 and TIMP1 did not exhibit significant differences between the cancerous and non-cancerous groups (P > 0.05) (Figs. 9E–9F).
Expression analysis of biomarkers in the cancer and normal groups using the quantitative real-time PCR (qRT-PCR) method.(A) CCND1, (B) CXCL8, (C) MYC, (D) VEGFA, (E) MMP3, and (F) TIMP1.
Discussion
CRC stands as one of the predominant causes of mortality globally, posing a significant health risk. Butyrate, a saturated SCFA derived from the microbial fermentation of dietary fiber, exhibits potential in suppressing tumorigenesis (Hanus et al., 2021), yet its precise mechanisms in CRC require further elucidation. Our study employed bioinformatic approaches to identify six BMGs as candidate diagnostic biomarkers for CRC (Li et al., 2024). KEGG pathway enrichment analysis indicated the involvement of these genes in the PPAR signaling pathway, among others (Kim et al., 2024). PPAR signaling plays a critical role in cellular processes, and its dysregulation has been implicated in colon carcinogenesis (Pinna, 2023). This suggests a potential link between butyrate metabolism and CRC development through PPAR-related mechanisms (Hernandez-Quiles, Broekema & Kalkhoven, 2021). In the scope of our analysis, we discerned six critical genes (CCND1, CXCL8, MMP3, MYC, TIMP1, and VEGFA) related to butyrate metabolism, with substantive roles in CRC pathogenesis. Specifically, CyclinD1 (CCND1), a pivotal regulator of cell cycle progression, emerged as a gene of interest. A robust body of epidemiological research has evaluated the association between the polymorphism G870A within the CCND1 gene and CRC risk, indicating its potential significance in oncogenesis (Wang et al., 2023a). CXCL8 promotes tumor progression through PI3K/Akt/mTOR signaling activation and serves as a key regulator of TME dynamics by recruiting immunosuppressive cells and facilitating M2-like macrophage polarization (Pennel et al., 2022), thereby creating an immunotolerant niche (Chen et al., 2024). Our analysis corroborates previous reports of elevated matrix metalloproteinase (MMP) expression in cancerous tissues, with MMP overexpression correlating with advanced CRC features, including lymph node metastasis, higher TNM staging, and increased tumor differentiation (Liu et al., 2023). Previous studies have demonstrated that elevated blood MMP-1 levels are associated with poorer five-year cancer-specific survival in colon cancer (HR = 2.0, 95% CI [1.1–3.9]), a trend that persists at 10-year follow-up (Jonsson et al., 2021). Furthermore, variations in the expression of the c-MYC proto-oncogene (MYC) and its related pathways have been observed in tumor samples, with higher MYC levels frequently seen in older, overweight, or obese patients. Notably, an increase in the MYC gene copy number was detected in 35% of tumors. Detailed transcriptomic studies have suggested that early onset colorectal cancer may be stratified into distinct categories based on MYC expression patterns, which also seem to influence the likelihood of CRC in younger populations (Jonsson et al., 2021). TIMP1 protein shows elevated concentrations in colon cancer tissues and lymph node metastases, with Cox regression analyses identifying it as an independent prognostic factor for CRC outcomes. Therapeutic suppression of TIMP1 demonstrates potential by reducing tumor growth and dissemination while promoting apoptosis through FAK-PI3K/AKT and MAPK pathway modulation (Kehusmaa et al., 2025). VEGFA, another significantly upregulated gene in our analysis, contributes to TME formation by promoting angiogenesis and establishing abnormal vascular networks that exacerbate hypoxia and nutrient deprivation (Li et al., 2023).
IPA demonstrated that these biomarkers collectively contribute to the remodeling of the TME through multiple mechanisms. CXCL8 acts as a potent chemokine that recruits immunosuppressive cells and promotes angiogenesis, thereby fostering an immune-tolerant microenvironment. VEGFA enhances vascular permeability and endothelial cell proliferation, establishing a dysfunctional vascular network. The MMP3/TIMP1 axis regulates extracellular matrix remodeling, facilitating tumor invasion and metastasis. MYC drives the metabolic reprogramming of both tumor and stromal cells, while CCND1 promotes cell cycle progression and may influence TME composition through the secretion of inflammatory mediators.
Of the six candidate biomarkers investigated, our qRT-PCR validation in colon cancer cell lines (LOVO, HCT116, LS174T, and LS513) consistently confirmed significant upregulation of CCND1, CXCL8, MYC, and VEGFA compared to normal colonic epithelial cells (NCM-460). However, we acknowledge a critical inconsistency regarding MMP3 and TIMP1. While our bioinformatic analysis of clinical samples nominated them as potential biomarkers, our experimental validation in cell lines failed to demonstrate statistically significant differential expression between normal and malignant colonic cells. Several methodological considerations may explain this discrepancy. First, our bioinformatic analysis utilized large clinical datasets (e.g., TCGA) containing diverse tumor subtypes and stromal components, which may capture expression patterns not fully recapitulated in individual cell lines. Second, the tumor microenvironment in clinical specimens includes complex cellular interactions that might regulate MMP3 and TIMP1 expressions in ways not reproduced in monoculture systems. Finally, the roles of MMP3 and TIMP1 in CRC appear highly context dependent, potentially requiring specific tumor–stroma interactions or particular molecular subtypes that may not be represented in the selected cell lines. These findings highlight the importance of complementing bioinformatic predictions with experimental validation across multiple model systems to fully characterize biomarker behavior.
K–M analyses have identified significant prognostic differences associated with three biomarkers (CXCL8, MMP3, and TIMP1), with survival rates improving in correlation with high CXCL8 and MMP3 expression but declining with high TIMP1 expression. The observation that elevated expression levels of CXCL8 and MMP3 were correlated with improved survival outcomes in our analysis contrasts with numerous reports describing their pro-tumorigenic roles (Kastinen et al., 2023). This paradox could be explained by the context-dependent functions of these genes, where their effects may vary based on tumor stage, molecular subtypes, or the specific composition of the TME. Future studies with larger, well-stratified cohorts are needed to resolve these apparent contradictions.
The ceRNA hypothesis has emerged as a prominent framework for exploring novel RNA interaction mechanisms. In this context, lncRNAs are understood to function as ceRNA “sponges”, binding miRNAs and, thus, influencing the degradation of miRNA target mRNA (Salmena et al., 2011). We constructed a lncRNA–miRNA–mRNA interaction network that encompasses four biomarkers, four miRNAs, and 51 lncRNAs based on predicted regulatory relationships between mRNA–miRNA and miRNA–lncRNA. Specific regulatory pairs identified include CXCL8-hsa-miR-1295a, among others, for mRNA-miRNA and CXCL8-SLC9A3-AS1, among others, for miRNA-lncRNA. Currently, the regulatory interactions of CXCL8-hsa-miR-1295a and CXCL8-SLC9A3-AS1 in CRC have not been reported in the literature. Thus, these regulatory networks present promising avenues for future research of butyrate metabolism in CRC and potential therapeutic strategies in CRC treatment.
The DGIDB has identified 156 potential drugs that target five biomarkers (CCND1, CXCL8, MMP3, MYC, and VEGFA) implicated in CRC. These include, for instance, Bevacizumab, targeting VEGFA; Palbociclib, inhibiting CCND1; and Marimastat, which acts on MMP3. It is crucial to emphasize that these are in silico predictions requiring experimental validation. Among these, Cetuximab is noted for its action on the pathways of MYC, CXCL8, and CCND1 (Sakata & Larson, 2022). Bevacizumab is reported to be therapeutically active in CXCL8 and VEGFA pathways (Morris et al., 2023), while Fluorouracil affects both CCND1 and VEGFA pathways (Nindra, Shahnam & Mahon, 2022). These three drugs are acknowledged as effective in CRC treatment (Benson et al., 2024).
Our research has yielded additional intriguing insights. For instance, vitamin E has been found to modulate the CCND1 pathway. A meta-analysis examining the link between serum vitamin E levels and CRC risk in case-control studies found that CRC patients typically had lower serum vitamin E concentrations (Ajebli et al., 2024). Dietary factors, such as the consumption of red meat, have been associated with an elevated risk of CRC. Conversely, reducing red meat intake and increasing the consumption of foods rich in antioxidants and chemoprotective substances could substantially diminish CRC risk (Papier et al., 2025). Moreover, a higher intake of dietary fiber has a proven inverse relationship with CRC risk. Fiber stimulates butyrogenic bacteria within the gut microbiota, which produces butyrate, a substance with pronounced anti-cancer properties. On the other hand, a high-fat diet is linked to increased CRC risk due to its influence on bile acid metabolism; it promotes the microbial conversion of bile acids to the tumorigenic deoxycholic acid. Therefore, dietary interventions that increase fiber and reduce fat intake can significantly influence the gut microbiota, adjusting the levels of butyrate and deoxycholic acid in the colon and potentially serving as preventative measures against CRC in healthy individuals (Mahjourian et al., 2025).
Our study has several limitations that should be considered when interpreting the results. First, most findings rely on computational analysis of public datasets (TCGA, GEO), which may carry inherent biases in sample collection, sequencing platforms, and batch effects. Second, the identified biomarkers and pathways were not validated in animal models or extensive clinical patient cohorts beyond our limited qRT-PCR validation. Third, only a few genes were experimentally validated in CRC cell lines, which restricts the strength of translational conclusions. Fourth, although ceRNA networks and pathway enrichments were constructed, no direct mechanistic assays were performed to confirm regulatory interactions. Fifth, the identified drug–gene interactions are computational predictions and require pharmacological testing in vitro/in vivo to establish clinical relevance. Sixth, TCGA/GEO analyses rely on bulk RNA-seq snapshots, which cannot fully capture the temporal dynamics or cell-type specific effects of butyrate metabolism. Finally, the study focuses on host gene expression but does not directly include microbial abundance or butyrate metabolite measurements.
Conclusions
In summary, this study identifies a robust BMG signature (CCND1, CXCL8, MYC, and VEGFA) with significant implications for CRC diagnosis and TME regulation. While bioinformatic predictions nominated additional candidates, their context-dependent expression highlights the complexity of biomarker validation. Future work should focus on the functional validation of the proposed ceRNA networks in advanced models, coupled with clinical correlation in stratified patient cohorts. Integrating host gene expression with microbial and metabolite data will further elucidate the diet-microbe-host axis in CRC, potentially guiding novel preventive and therapeutic strategies.
Supplemental Information
10.7717/peerj.20942/supp-1Supplemental Information 1Differentially expressed BMGs (DE-BMGs)
10.7717/peerj.20942/supp-2Supplemental Information 2Identification results of differentially expressed genes (DEGs) associated with TCGA-CRCStatistics of upregulated and downregulated gene counts.
10.7717/peerj.20942/supp-3Supplemental Information 3Analysis results of differentially expressed genes (DEGs) between CRC and normal groups in the GEO datasetStatistics of upregulated and downregulated gene counts.
10.7717/peerj.20942/supp-4Supplemental Information 4Analysis of differentially expressed biomarker genes (DE-BMGs) with significant enrichment
10.7717/peerj.20942/supp-5Supplemental Information 5KEGG pathway analysis resultsBiological processes significantly associated with differentially expressed biomarker genes (DE-BMGs).
10.7717/peerj.20942/supp-6Supplemental Information 6Comprehensive analysis results of differentially expressed microRNAs (DE-miRNAs)
10.7717/peerj.20942/supp-7Supplemental Information 7Identification results of differentially expressed long non-coding RNAs (DE-lncRNAs) between the two groups
10.7717/peerj.20942/supp-8Supplemental Information 8Identification results of specific mRNA-miRNA and miRNA-lncRNA interactions
10.7717/peerj.20942/supp-9Supplemental Information 9Survival and clinical feature analysis of biomarkersA .Kaplan-Meier curves for the overall survival (OS) of patients in the high-expression group and low-expression group with biomarkers CCND1; B.MYC and VEGFA . C.Kaplan-Meier curves for OS of patients in the high-expression group and low-expression group with different clinical features, CCND1, MYC and VEGFA.
10.7717/peerj.20942/supp-10Supplemental Information 10Data processing
10.7717/peerj.20942/supp-11Supplemental Information 11MIQE checklist
10.7717/peerj.20942/supp-12Supplemental Information 12Raw data
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ajebli M Meretsky CR Akdad M Amssayef A Hebi M 2024 The role of dietary vitamins and antioxidants in preventing colorectal cancer: a systematic review Cureus 16e 6427710.7759/cureus.6427739130946 PMC 11315617 · doi ↗ · pubmed ↗
- 2Avuthu N Guda C 2022 Meta-analysis of altered gut microbiota reveals microbial and metabolic biomarkers for colorectal cancer Microbiology Spectrum 10e 000132210.1128/spectrum.00013-2235766483 PMC 9431300 · doi ↗ · pubmed ↗
- 3Benson AB Venook AP Adam M Chang G Chen YJ Ciombor KK Cohen SA Cooper HS Deming D Garrido-Laguna I Grem JL Haste P Hecht JR Hoffe S Hunt S Hussan H Johung KL Joseph N Kirilcuk N Krishnamurthi S Malla M Maratt JK Messersmith WA Meyerhardt J Miller ED Mulcahy MF Nurkin S Overman MJ Parikh A Patel H Pedersen K Saltz L Schneider C Shibata D Shogan B Skibber JM Sofocleous CT Tavakkoli A Willett CG Wu C Gurski LA Snedeker J Jones F 2024 Colon cancer, Version 3.2024, NCCN clinical practice guidelines in oncology Journal of the National Comprehens · doi ↗ · pubmed ↗
- 4Beyene KM El Ghouch A 2022 Time-dependent ROC curve estimation for interval-censored data Biometrical Journal 641056107410.1002/bimj.20200038235523738 · doi ↗ · pubmed ↗
- 5Bhola N Bhardwaj S Bareja C Saluja D 2025 Integrative analysis of angiogenesis-related genes highlights prognostic indicators in colorectal cancer through single-cell sequencing and immune infiltration analysis Discover Oncology 16190510.1007/s 12672-025-03614-941105335 PMC 12534639 · doi ↗ · pubmed ↗
- 6Bray F Laversanne M Sung H Ferlay J Siegel RL Soerjomataram I Jemal A 2024 Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries CA: A Cancer Journal for Clinicians 7422926310.3322/caac.2183438572751 · doi ↗ · pubmed ↗
- 7Chen D Jin D Huang S Wu J Xu M Liu T Dong W Liu X Wang S Zhong W Liu Y Jiang R Piao M Wang B Cao H 2020 Clostridium butyricum, a butyrate-producing probiotic, inhibits intestinal tumor development through modulating Wnt signaling and gut microbiota Cancer Letters 46945646710.1016/j.canlet.2019.11.01931734354 · doi ↗ · pubmed ↗
- 8Chen Y Zhang S Wu J Xu D Wei C Li F Xie G 2024 Exploring the link between serum uric acid and colorectal cancer: insights from genetic evidence and observational data Medicine 103e 4059110.1097/md.000000000004059139809176 PMC 11596604 · doi ↗ · pubmed ↗
