Identification of Potential Functional Modules and Diagnostic Genes for Crohn’s Disease Based on Weighted Gene Co-expression Network Analysis and LASSO Algorithm
Ruiquan Wang, Hongqi Zhao

TL;DR
This study identifies key genes and modules for diagnosing Crohn’s disease using network analysis and a machine learning algorithm, offering a promising diagnostic tool.
Contribution
A novel diagnostic model for Crohn’s disease using WGCNA and LASSO, with high accuracy and identified hub genes.
Findings
651 differentially expressed genes were linked to inflammation and leukocyte chemotaxis pathways.
The turquoise module was associated with macrophages M2 in CD patients.
A diagnostic model with eight hub genes achieved 0.94 AUC in training and 0.941 in validation sets.
Abstract
Accurate diagnosis of Crohn’s disease (CD) is paramount due to its resemblance to other inflammatory bowel diseases. Early and precise diagnosis plays a pivotal role in tailoring personalized treatments, thereby enhancing the quality of life for CD patients. Differential gene expression analysis was conducted to identify genes from the mRNA expression profiles of CD samples, followed by pathway enrichment analysis. The immune cell infiltration levels of each CD patient sample were assessed. Using weighted gene co-expression network analysis, key gene modules linked to CD were found. Hub gene identification was made easier by the construction of protein–protein interaction networks. Next, utilizing the Least Absolute Shrinkage and Selection Operator on the hub genes in the training set, a diagnostic model was created. The accuracy of the model was then confirmed using a different…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInflammatory Bowel Disease
Introduction
Crohn’s disease (CD) is one of the inflammatory bowel diseases (IBD). Unlike other inflammatory diseases, CD is characterized by frequent relapses that are difficult to control, predominantly affecting patients between the ages of 15 and 35.^1^ Stimulation of the immune system induces partial destruction of the intestinal mucosa, resulting in symptoms such as diarrhea, rectal bleeding, weight loss, fever, abdominal pain, and abdominal cramps, leading to a decreased quality of life.^2, 3^
The diagnosis of CD is primarily based on clinical manifestations, histology, endoscopy, radiology, and/or biochemical examinations.^4^ Although a complete pathological examination can provide a more accurate diagnosis, it is challenging to perform biopsies on patients during disease flares. Furthermore, distinguishing CD from other intestinal diseases based on endoscopic examinations, CT scans, and histological features is difficult in clinical practice.^5,6^ Delayed diagnosis and treatment of CD due to difficulties in differentiation can lead to severe complications such as intestinal perforation and fistula.^7^ Early diagnosis is crucial to ensure timely medical intervention, control disease progression, and improve patients’ quality of life. Therefore, making the correct differential diagnosis for CD is of paramount importance.
Weighted gene co-expression network analysis (WGCNA) is applied to mine module information from gene expression data obtained from microarrays.^8^ It allows for gene clustering, forming modules based on similar gene expression patterns to explore the association between modules and specific features, such as patients’ clinical information.^8^ Currently, WGCNA has proven to be effective in identifying therapeutic targets and biomarkers in various diseases. Ai et al^9^ constructed a diagnostic model based on 10 genes using asthma-related features, which demonstrates good diagnostic performance. Hsueh et al^10^ developed an optimal diagnostic model for a hereditary neuromuscular disorder through the screening of clinical patients, achieving higher diagnostic rates compared to other studies, with detection rates exceeding 70% in certain cases. It is evident that WGCNA is a powerful tool that enables researchers to uncover hidden patterns and correlations in gene expression data, thereby elucidating gene regulatory mechanisms, identifying biomarkers, predicting gene functions, and providing novel insights and revelations for biological research and disease diagnosis. In this study, we aim to apply WGCNA to extract key genes linked to CD and construct a diagnostic model to assist clinicians in making accurate diagnoses of CD.
In this investigation, we analyzed gene expression data related to CD from Gene Expression Omnibus (GEO). Immunocytes related to CD were selected as epigenetic data. Utilizing WGCNA and Least Absolute Selection and Shrinkage Operator (LASSO) logistic regression, we aimed to screen out key genes and build a diagnostic model to provide clinical reference for CD diagnosis (Figure S1).
Materials and Methods
Data Download and Differentially Expressed Genes Identification
The GSE186582 dataset (annotation platform: GPL570) containing 196 CD samples and 25 control samples was downloaded from GEO (https://www.ncbi.nlm.nih.gov/)..) Differential analysis was done utilizing “limma” package (Version 3.54.2) with a threshold set at |log FC| > 1 and adjust_P < .05 to identify differentially expressed genes (DEGs). For microarray data, this study utilized the normalizeBetweenArrays function for normalization. Following normalization, linear model fitting and gene detection were performed using the lmFit and eBayes functions.
Kyoto Encyclopedia of Genes and Genomes and Gene Ontology Enrichment Analyses
Gene ontology (GO) enrichment analysis is a method of annotating gene products, providing insights into the biological processes, molecular functions, and cellular locations in which genes are enriched. Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis is a systematic analysis of gene products involved in cellular metabolic pathways. We utilized the “clusterProfiler” package (Version 4.6.2) to do GO and KEGG enrichment analyses on DEGs in CD (P < .05), identifying the most significantly enriched biological functions and signaling pathways.
CIBERSORT
Based on the concepts of linear support vector regression, CIBERSORT is a deconvolution technique applied to human immune cell subtype expression matrices. According to transcriptome data, CIBERSORT can estimate the composition and abundance of human immune cells. By analyzing the relative expression levels of genes in individual tissue samples based on gene expression profiles, CIBERSORT determines expression of 22 tumor-infiltrating immune cells in each sample. During this process, a multiple testing correction was performed 1000 times. Monte Carlo sampling is used to derive a P-value as a measure of confidence for each sample’s deconvolution. Samples with immune cell inference from CIBERSORT that have a P < .05 are statistically significant. Subsequently, statistical results from CIBERSORT are used to filter the epigenetic data for WGCNA analysis. Ethics Committee Approval and Informed Consent were not required as this study was based on publicly available data.
Weighted Gene Co-expression Network Analysis
“WGCNA” package (Version 1.72-1) was employed for further analysis of DEGs to build a scale-free co-expression network. A soft threshold of β = 9 was chosen to ensure a scale-free network. Each module’s gene information was retrieved, with a minimum gene count of 10. A hierarchical clustering dendrogram was produced using the blockwiseModules function. Branches of the clustering dendrogram were combined into distinct gene modules using the dynamic tree cut approach. Various colors were used to depict the resultant modules. Additionally, “pheatmap” package (Version 1.72-1) was employed to visualize correlations between epigenetic data and gene modules and to identify significant gene modules.
The Establishment of a Protein–Protein Interaction Network and Selection of Hub Genes
Genes were extracted from significant modules. The Search Tool for the Retrieval of Interacting Genes (STRING; http://string.embl.de/)) was employed to construct a protein–protein interaction (PPI) network. The minimum confidence score was set at 0.9 to obtain important genes. The hub genes were selected by utilizing 5 algorithms (Maximum Neighborhood Component (MNC), Maximal Clique Centrality (MCC), Density of Maximum Neighborhood Component (DMNC), Edge Percolated Component (EPC), and Degree) in the cytoHubba plugin of Cytoscape software.
The Screening and Validation of Diagnostic Biomarkers
The create Data Partition function from the caret package (Version 6.0-94) was used to divide the samples into training and validation sets at random in a 7:3 ratio. A LASSO logistic regression diagnostic model was created with hub genes for the training data. Model diagnostic ability was determined by plotting receiver operating characteristics (ROC) curve and computing area under the curve (AUC) value. Model was further validated using the validation set.
Results
Differentially Expressed Genes and Their Functional Enrichment in Crohn’s Disease
In order to understand the functions of DEGs related to CD and elucidate the pathogenesis of CD, we conducted enrichment analysis. Differential analysis was performed on 196 CD samples and 25 control samples from the GEO dataset, resulting in 651 DEGs (Supplementary Table 1) (Figure 1A). To further figure out the potential functions and pathways related to CD patients, GO and KEGG were carried out on these DEGs. These genes were predominantly enriched in biological functions such as leukocyte chemotaxis, apical part of cell, and extracellular matrix structural constituent, according to GO results (Figure 1B). They were most abundant in pathways including cytokine–cytokine receptor interaction, chemokine signaling pathway, IL-17 signaling pathway, TNF signaling pathway, and viral protein interaction with cytokine and cytokine receptor, as revealed by KEGG results (Figure 1C). The DEGs in CD are primarily involved in inflammation-related pathways.
The Calculation of Immune Cell Composition and Selection of Epigenetic Data in Crohn’s Disease Samples
The aberrant immune system activation is intimately linked to the pathophysiology of CD. Researching the connection between CD and immunity can provide us a theoretical framework for developing treatment approaches that target immune regulation by illuminating the immune system’s function in the initiation and progression of the illness. To this end, CIBERSORT algorithm was employed to test immune cell composition in CD samples. A heatmap was utilized to visualize the results (Figure 2A). As revealed by the proportion statistics, the proportions of macrophages M2, plasma cells, and T cells CD4 memory resting were higher in CD patients (Figure 2B). Differences in each immune cell type between the control and CD groups were analyzed. The results revealed that B cells memory, eosinophils, macrophages M2, mast cells resting, and T cells CD8 were considerably increased in the control group compared to CD patients. Macrophages M1, monocytes, plasma cells, neutrophils, and T cells CD4 memory activated were considerably lower in control group compared to CD patients (Figure 2C). Previous studies have pointed out that an increased abundance of macrophages M2 is linked to the resolution or improvement of chronic inflammation, being capable of alleviating the symptoms of IBD.^11^ Therefore, we selected macrophages M2 as the epigenetic data for further analysis in CD.
The Selection of Crohn’s Disease-Related Key Modules Using Weighted Gene Co-expression Network Analysis
To explore genes associated with CD, we conducted WGCNA analysis to find gene modules connected to the occurrence and course of CD. Further analysis was conducted on the aforementioned DEGs to construct a scale-free co-expression network (scale-free R > 0.8) with a soft thresholding power of 9 (Figure 3A and B). A clustering dendrogram was generated. Using the dynamic tree cut approach, the dendrogram’s branches were combined into 6 unique gene modules, each denoted by a different color (Figure 3C). The correlation coefficients were calculated to detect the correlation between each WGCNA module and immune cells. The ME turquoise module, which exhibited a relatively high correlation (0.46) with M2 macrophages, was selected as the focus module for subsequent analysis (Figure 3D). This module comprised 381 genes (Supplementary Table 2) that were used for further analysis.
The Construction of a Protein–Protein Interaction Network and Selection of Hub Genes
To further screen out the core genes in the modules, we performed PPI analysis. The 381 genes from the ME turquoise module were utilized to establish a PPI network with a confidence score higher than 0.9 (Figure 4A). Network was subsequently subjected to hub gene selection by utilizing cytoHubba plugin in Cytoscape. The hub genes were found by considering the intersection of the genes that were chosen using the following 5 criteria: MNC, MCC, DMNC, EPC, and Degree. A total of 13 hub genes (APOA1, APOA4, APOB, APOC3, CYP2B6, CYP2C8, CYP2C9, CYP2J2, EPHX2, FABP1, HSD3B1, LPL, and PPARGC1A) were obtained (Figure 4B).
The Construction and Validation of Diagnostic Model
The 13 hub genes obtained by screening may be critical in CD progression. To explore diagnostic efficacy of these genes for CD, we performed the following analysis. According to a 7:3 ratio, dataset was grouped into training and validation sets. After LASSO regression analysis on the training set’s 13 hub genes, 8 hub genes—APOA1, APOA4, CYP2C8, CYP2C9, CYP2J2, EPHX2, HSD3B1, and LPL—were identified for use in building the diagnostic model (Figure 5A and B). The model formula is as follows. Coefficients for model genes were generated by LASSO regression analysis (Supplementary Table 3).
Receiver operating characteristics curve analysis and calculation of the AUC were conducted in the training set (Figure 5C), demonstrating a diagnostic efficiency of 0.94. Model diagnostic ability was validated in validation set, yielding a diagnostic efficiency of 0.941 (Figure 5D). These results indicated that the diagnostic model containing immune-related key genes exhibits good diagnostic capability for CD patients. Furthermore, the 8 hub gene levels were analyzed in both the control group and CD samples, revealing considerably elevated expression in the control group compared to CD samples (Figure 5E), further emphasizing the significant value of the selected hub genes for CD diagnosis.
Discussion
Ngollo et al^12^ used dataset GSE186582 to identify genes associated with increased risk of postoperative recurrence in CD. In this study, the gene expression of patients in this dataset was used to identify the DEGs related to CD, and WGCNA and other methods were applied to screen 8 hub genes for constructing a CD diagnostic model, aiming to provide a new idea for the clinical diagnosis of CD.
We herein identified 651 DEGs associated with CD from the GEO dataset. The DEGs in CD are primarily involved in biological functions such as leukocyte chemotaxis. Chemokines and cytokines take crucial parts in regulating mucosal inflammation by facilitating migration of leukocytes to sites of inflammation, eventually resulting in tissue damage and destruction.^13^ Furthermore, CD is mainly involved in inflammatory response pathways such as TNF signaling and IL-17 signaling. Since inflammation is greatly implicated in the pathogenesis of CD, the inhibition of it can alleviate symptoms and prevent further tissue damage.^14^ Research by Louis et al^15^ has shown a significant increase in TNF-α in CD patients. Additionally, inhibition of TNF-α can effectively alleviate intestinal inflammation.^16^ Anti-TNF agents are among the first biologics approved for treating IBD and have been widely used for remission and treatment of CD patients.^17^ Anti-TNF treatment appeared to significantly ameliorate the pro-inflammatory phenotypes of macrophages and adipose-derived stem cells (ADSCs) in CD-switched ADSCs.^18^ IL-17 is a cytokine with strong pro-inflammatory activity. Studies have indicated that IL-17 mainly accumulates in the submucosal and muscularis propria layers of CD patients, possibly being associated with changes in intestinal mucosal immunity and inflammatory responses.^19,20^ Ashton et al^21^ showed that IL17 signaling was enriched in monocytes and epithelial cells. We found that the occurrence of CD is primarily associated with inflammation. IL-17 signaling and TNF signaling may serve as effective targets for CD treatment.
We identified 8 hub genes (APOA1, APOA4, CYP2C8, CYP2C9, CYP2J2, EPHX2, HSD3B1, and LPL) using methods such as WGCNA, PPI network, and LASSO logistic regression for constructing a diagnostic model. APOA1 and APOA4 are apolipoproteins of the apolipoprotein A family that can promote lipid transport and metabolism,^22^ exerting anti-inflammatory effects. CYP2C8 and CYP2C9 are drug-metabolizing enzymes involved in inactivation of various drugs,^23^ serving as functional enzymes to play a role in the independent regulation of intestinal function.^24^ CYP2J2 is majorly expressed in the intestine and can regulate intestinal motility and/or intestinal fluid/electrolyte transport.^25^ Research has found that the absence of CYP2J2 in CD macrophages may contribute to the development of CD.^26^ Reisdorf et al^27^ revealed that repression of EPHX2 significantly reduces IL-1β, a pro-inflammatory factor, in CD. HSD3B1 is a key member of the steroid hormone family.^28^ He et al^29^ constructed a CD prognosis model based on 4 genes, including HSD3B1, achieving good prognostic performance. Bo et al^30^ demonstrated that LPL is significantly upregulated in CD and may take a pivotal part in the pathogenesis of CD. Based on the above, the hub genes used in this study for constructing the diagnostic model are essential for the pathophysiology and development of CD, contributing to the clinical diagnosis of CD. Our results also demonstrated that the diagnostic model established with the selected hub genes has good diagnostic performance.
Although this study developed a new CD diagnostic model based on the GEO database, there are limitations. Our sample size was small and there was a mismatch between CD patients and controls. In the future, we will conduct external studies and validation in other independent cohorts of CD patients to evaluate generalization and robustness of this new CD diagnostic model. Secondly, we will construct an animal model and perform a large number of cellular experiments (such as the function of genes in the pathway) to further investigate the specific role of Hub genes in CD. In order to provide more in-depth and comprehensive support for future CD diagnosis research, a comprehensive analysis was conducted by combining experimental data and clinical data.
Supplementary Materials
Supplementary Material
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Seyedian SS Nokhostin F Malamir MD . A review of the diagnosis, prevention, and treatment methods of inflammatory bowel disease. J Med Life. 2019;12(2):113 122. (10.25122/jml-2018-0075)31406511 PMC 6685307 · doi ↗ · pubmed ↗
- 2Veauthier B Hornecker JR . Crohn’s disease: diagnosis and management. Am Fam Phys. 2018;98(11):661 669.30485038 · pubmed ↗
- 3Saez A Herrero-Fernandez B Gomez-Bris R Sánchez-Martinez H Gonzalez-Granado JM . Pathophysiology of inflammatory bowel disease: innate immune system. Int J Mol Sci. 2023;24(2). (10.3390/ijms 24021526)PMC 986349036675038 · doi ↗ · pubmed ↗
- 4Panés J Bouzas R Chaparro M , et al. Systematic review: the use of ultrasonography, computed tomography and magnetic resonance imaging for the diagnosis, assessment of activity and abdominal complications of Crohn’s disease. Aliment Pharmacol Ther. 2011;34(2):125 145. (10.1111/j.1365-2036.2011.04710.x)21615440 · doi ↗ · pubmed ↗
- 5Kedia S Das P Madhusudhan KS , et al. Differentiating Crohn’s disease from intestinal tuberculosis. World J Gastroenterol. 2019;25(4):418 432. (10.3748/wjg.v 25.i 4.418)30700939 PMC 6350172 · doi ↗ · pubmed ↗
- 6Horn MP Peter AM Righini Grunder F , et al. PR 3-ANCA and panel diagnostics in pediatric inflammatory bowel disease to distinguish ulcerative colitis from Crohn’s disease. PLOS ONE. 2018;13(12):e 0208974. (10.1371/journal.pone.0208974)30557305 PMC 6296712 · doi ↗ · pubmed ↗
- 7Jung Y Hwangbo Y Yoon SM , et al. Predictive factors for differentiating between Crohn’s disease and intestinal tuberculosis in koreans. Am J Gastroenterol. 2016;111(8):1156 1164. (10.1038/ajg.2016.212)27296940 · doi ↗ · pubmed ↗
- 8Langfelder P Horvath S . WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9:559. (10.1186/1471-2105-9-559)19114008 PMC 2631488 · doi ↗ · pubmed ↗
