Construction and validation of a novel diagnostic model for esophageal squamous cell carcinoma: an integrated analysis of multi-omics data
Yiyuan Cui, Sicong Li, Zhibin Wu, Yue Jin, Jingjie Yu, Yufan Chen, Jiayang Chen, Jinyuan Chang, Yijing Yan, Xinyu Li, Nuo Li, Shengjuan Hu, Chenxin Zhu, Li Feng

TL;DR
Researchers developed a new diagnostic model for esophageal squamous cell carcinoma using multi-omics data and identified five key genes for early detection.
Contribution
A novel diagnostic model for ESCC using integrated multi-omics data and identification of five robust biomarkers.
Findings
A diagnostic model with five genes achieved high accuracy (AUCs of 0.99 and 0.97) in training and validation sets.
SORBS2 was identified as a tumor-suppressor gene predominantly expressed in myofibroblasts in ESCC tissue.
Single-cell RNA analysis revealed myofibroblasts as the main source of SORBS2 expression in tumor tissue.
Abstract
Esophageal squamous cell carcinoma (ESCC), highly prevalent in China, has a limited number of ideal genes for early diagnosis, highlighting the need for the development of novel biomarkers to improve detection capabilities. The purpose of this study is to develop and validate a new genetic diagnostic model for ESCC. Publicly available bulk RNA-seq datasets (GSE23400, GSE17351, GSE20347) were merged to identify differentially expressed genes (DEGs) between ESCC and adjacent normal tissues. Weighted gene co-expression network analysis (WGCNA) and protein-protein interaction (PPI) were performed to identify hub genes associated with ESCC. We identified the intersecting genes between the DEGs and those within the ESCC-related module identified by WGCNA. We subsequently refined these intersecting genes via LASSO regression and then constructed a diagnostic model for ESCC using multivariate…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12| Gene | B | Wald | OR (95%CI) | P value |
|---|---|---|---|---|
| (Intercept) | -24.46 | 5.83 | 0 (0~0.002) | 0.01 |
|
| 2.06 | 4.87 | 7.82 (1.44~56.07) | 0.03 |
|
| 0.74 | 0.46 | 2.095 (0.28~20.95) | 0.50 |
|
| 2.38 | 9.28 | 10.76 (2.45~54.29) | <0.01 |
|
| 0.57 | 0.72 | 1.76 (0.47~6.76) | 0.40 |
|
| -2.10 | 6.34 | 0.12 (0.02~0.61) | 0.01 |
| Metric | Training set (N = 252) | TCGA External validation set (N = 91) | TCGA External validation set (N = 91) | |
|---|---|---|---|---|
| Asian (N = 25) | White (N = 57) | |||
| AUC | 0.99 (0.98-1.00) | 0.98 (0.94-1.00) | 0.95 (0.85–1.00) | 0.98 (0.95–1.00) |
| Accuracy | 0.95 (0.95-0.95) | 0.87 (0.87-0.87) | 0.84 (0.83–0.85) | 0.93 (0.93–0.93) |
| Sensitivity | 0.94 (0.90-0.98) | 0.85 (0.77-0.93) | 0.81 (0.64–0.98) | 0.92 (0.85–1.00) |
| Specificity | 0.95 (0.92-0.99) | 1.00 (1.00-1.00) | 1.00 (1.00–1.00) | 1.00 (1.00–1.00) |
- —National Key Research and Development Program of China10.13039/501100012166
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEsophageal Cancer Research and Treatment · Ferroptosis and cancer prognosis · Single-cell and spatial transcriptomics
Introduction
1
Esophageal cancer is a malignant tumor that poses a serious threat to human health, ranking seventh globally in incidence and sixth in mortality (1). According to global cancer statistics for 2020, China accounted for 53.70% of new esophageal cancer cases and 55.35% of deaths worldwide (2). Notably, esophageal squamous cell carcinoma (ESCC) represents over 95% of cases in China, in stark contrast to Western countries where esophageal adenocarcinoma (EAC) predominates. Unfortunately, ESCC patients often exhibit no obvious symptoms in the early stages, and most are diagnosed only after experiencing dysphagia or metastatic symptoms (3). The majority of diagnosed ESCC patients are in the middle or late stages of the disease, with an overall five-year survival rate of only 6%–15% (4). If detected early and treated promptly, patients’ survival rates would significantly improve (2).
At present, reliable early-stage tumor markers and diagnostic modalities for ESCC remain markedly insufficient (5). Conventional tumor markers—such as squamous cell carcinoma antigen, carcinoembryonic antigen, and cytokeratin 19 fragment—are insufficiently specific or sensitive to serve as standalone diagnostic biomarkers for ESCC (6, 7). MicroRNAs show potential for early ESCC detection, but existing diagnostic models remain limited. For instance, miR-146a exhibits a specificity of only 68.6% and a sensitivity of 85.7% in discriminating ESCC (8), whereas serum miR-1246 offers 71.3% sensitivity and 73.9% specificity (9). Therefore, it is particularly important to develop new diagnostic methods to achieve early detection of ESCC.
Most tumor diagnostic models are constructed primarily with data from the TCGA database, which is overwhelmingly composed of samples from European descent (10). However, the high-risk population for ESCC predominantly resides in Asia, particularly China and Japan (11). Only relying on TCGA data to develop an ESCC diagnostic model introduces notable limitations. The GSE23400, GSE17351, and GSE20347 datasets predominantly comprise samples from Chinese and Japanese populations (12–14). Constructing ESCC diagnostic models using these datasets and identifying tumor suppressor genes could offer valuable insights for ESCC prevention and therapeutic strategies.
In this study, we screened five genes significantly associated with ESCC based on the GEO database, constructed a diagnostic model and nomogram, supported by experimental evidence showing differential expression patterns of these genes between tumor and normal tissues in animal models, complemented by human ESCC tissue immunofluorescence analysis of protective factors. In addition, by integrating single-cell RNA analysis tools, we delved deeply into the biological processes that genes might be involved in, providing potential insights for the clinical treatment of ESCC patients.
Materials and Methods
2
Study design
2.1
The workflow of our study was shown in Figure 1. To obtain the ESCC sample, we retrieved and integrated three GEO datasets (GSE23400, GSE17351, and GSE20347) after batch effect removal. We performed DEG to identify differentially expressed genes between ESCC and adjacent normal tissues. WGCNA was employed to delineate gene modules most strongly associated with ESCC, and the intersection of WGCNA and DEGs refined the ESCC-related gene set. To identify the key pathways and biological processes, we conducted the GSEA for the merged dataset, DEGs and WGCNA. Potential biomarkers were identified using LASSO regression, followed by the construction of an ESCC diagnostic model via logistic regression and external validation in the TCGA ESCC dataset. To characterize the immune microenvironment of the tumor versus adjacent normal tissues, the immune infiltration analysis was conducted. The influence of the protective gene SORBS2 on the immune microenvironment was specifically examined, and GSEA further explored its functional roles. To validate the differential expression of candidate biomarkers between normal and tumor tissues, we established a Mice ESCC Model and carried out the WB experiment. Single-cell RNA analysis was performed to explore the cell type of tumor tissues and the potential functions of the protective gene. Finally, immunofluorescence staining of human ESCC specimens was carried out to confirm SORBS2 protein expression and localization.
The workflow of the study. WGCNA, weighted gene co-expression network analysis; DEGs, differentially expressed genes; GSEA, Gene Set Enrichment Analysis; GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; ESCC, esophageal squamous cell carcinoma; LASSO, the least absolute shrinkage and selection operator; BID, BH3-interacting domain death agonist; CBX3, chromobox protein homolog 3; ECT2, epithelial cell transforming sequence 2; KIF14, kinesin family member 14; SORBS2, sorbin and SH3 domain-containing protein 2; TCGA, The Cancer Genome Atlas.
Data acquisition and pre-processing
2.2
Bulk transcriptome datasets
2.2.1
All datasets in this study were downloaded from the GEO database (https://www.ncbi.nlm.nih.gov/geo/). The GSE23400 dataset includes a total of 53 adjacent normal esophageal tissues and 53 ESCC tissues. The GSE17351 dataset comprises a total of 5 normal and 5 ESCC tissues. In the GSE20347 dataset, there are a total of 17 adjacent normal tissues and 17 ESCC tissues.
Elimination of batch effects and data merging
2.2.2
The Robust Multiarray Average (RMA) technique was applied to correct background noise, normalize signals, and calculate expression values (15, 16). The inSilicoMerging package (v1.34.0) within the R (v4.3.1) software environment was utilized to integrate the three datasets above (17). Following this, the strategy was implemented to remove batch effects, leading to a consistent expression matrix for both normal and esophageal tissues (18).
Identification of ESCC-relevant genes
2.3
Differentially expressed genes
2.3.1
We evaluated the distribution of gene expression levels in different samples in GSE23400, GSE17351 and GSE20347. Principal component analysis (PCA) and sample hierarchical clustering tree were applied to explore the intergroup difference and intragroup sample duplications. Abnormal samples were removed from subsequent analysis. The limma package (v3.56.0) was used for screening the DEGs with the cutoff criteria of P < 0.05 and |log2FC|>0.585, which was further visualized by pheatmap (v1.0.12) and ggplot2 (v3.5.1) packages in the heatmap and volcano map.
Gene ontology and kyoto encyclopedia of genes and genomes enrichment analysis on DEGs
2.3.2
Gene Ontology (GO) analysis was performed utilizing the clusterProfiler R package (v4.9.0). For the functional analysis, parameters were configured with a significance level of P < 0.05, leading to the identification of molecular functions (MFs), biological processes (BPs), and cellular components (CCs). Furthermore, the clusterProfiler package (v4.9.0) was employed to set a P < 0.05 for functional analysis. This methodology enabled the identification of enriched signaling pathways associated with genes that may serve as potential therapeutic targets, as revealed through KEGG functional enrichment analysis (19).
Weighted gene co-expression analysis
2.3.3
The WGCNA R package (v1.72-2) was utilized to create a co-expression network encompassing all genes from both ESCC and normal tissue samples. For further analysis, genes with variances of up to 50% were selected. The co-expression matrix was generated by determining the Pearson correlation coefficient between genes. Subsequently, an adjacency matrix was formulated using the equation a_mn_ = |c_mn_|^β^, where a_mn_ denotes the adjacency between genes m and n, c_mn_ is the Pearson correlation coefficient, and β represents the soft threshold power. This weighted adjacency matrix was then converted into a topological overlap measure matrix to assess the genes’ network connectivity. The clustering dendrogram of the matrix was generated through the application of average linkage hierarchical clustering. A minimum gene module size of 30 was set to identify the appropriate module, and a threshold of 0.25 was established for merging similar modules.
Gene set enrichment analysis on protective gene against the ESCC
2.3.4
In this research, GSEA was executed using the Molecular Signatures Database (MSigDB) Collection (c2.all.v7.0.entrez.gmt) via the clusterProfiler package (v4.9.0). The aim was to identify significant differences in pathways between ESCC or normal tissues. The distinction between ESCC or normal tissues served as the phenotypic label for the analysis, with the number of permutations configured to 1000. The remaining parameters were left at their default settings. Pathways with an adjusted P < 0.05 and an FDR < 0.25 were deemed to be significantly enriched.
Immune cell infiltration analysis for genes protective against ESCC
2.3.5
CIBERSORT is a computational tool capable of inferring gene expression profiles and calculating the relative abundance of various cell types within a heterogeneous cell population based on gene expression data (https://cibersortx.stanford.edu). We utilized CIBERSORT to assess the relative proportions of 22 distinct immune cell types. Furthermore, we applied the two-sample t-test to explore differences in immune cell infiltration between the normal and ESCC groups, and between the low-expression SORBS2 group and the high-expression SORBS2 group.
Construction and performance assessment of a novel diagnostic model
2.4
Sample size calculation
2.4.1
To achieve a power of 80% with a two-sided α of 0.05 and an assumed odds ratio (OR) of 2.1, we calculated the required sample size using the powerMediation package (v0.3.4) in R (v4.5.0). Assuming a 1:1 ratio between the ESCC group and the normal control group, the total sample size required was calculated to be 245 individuals.
Construction and performance assessment of a novel diagnostic model
2.4.2
We developed a new diagnostic model using gene expression data from 126 ESCC tissue samples and their adjacent normal tissues in the merged GEO cohort, comprising a total of 252 samples across both groups. Initially, we employed the LASSO technique from the glmnet package (v4.1-9) to select genes closely associated with ESCC to prevent model overfitting, with 50-fold cross-validation. Subsequently, a diagnostic model was constructed using multivariate logistic regression based on the biomarkers identified by LASSO, and it was graphically represented as a nomogram using the rms package (v6.3.0). To quantify over-fitting, we performed 1,000 bootstrap resamples within the training set (n = 173) and calculated optimism-corrected R². Additionally, we extracted all of the ESCC samples, a total of 91 samples, from the TCGA database to serve as an external validation set. Ethnicity information for TCGA-ESCC samples was extracted from clinical metadata, with self-reported race categorized as asian, white, black or african American, or unknown. We assessed the model’s predictive performance on the training and external validation sets through ROC curve analysis, with evaluation metrics including AUC value, precision, specificity, sensitivity, positive predictive value (PPV), and negative predictive value (NPV). We also used calibration curve analysis to assess the consistency between predicted probabilities and the actual frequency of ESCC occurrence, and employed decision curve analysis (DCA) to evaluate the clinical utility of the nomogram.
Validation of the differential expression of marker genes in mouse ESCC tissues
2.5
ESCC cell culture
2.5.1
The mouse ESCC cell line AKR were purchased from Cellverse Co., Ltd. Cells were cultured in DMEM supplemented with 10% fetal bovine serum and 1% penicillin–streptomycin until a confluent monolayer formed. When cells were grown at a density suitable for passage, they were trypsinized and passaged at a 1:3 ratio, and incubated under conditions of 37 °C and 5% CO_2_. After centrifugation, 4 × 10^6 AKR cells were resuspended in a mixture of PBS and Matrigel (4:1 ratio), where PBS was purchased from Beyotime Biotechnology, China, and Matrigel from Corning, USA.
Construction of an ESCC mouse model
2.5.2
All animal experiments conducted in this study received approval from the Animal Experiment Ethics Committee of Cancer Hospital, Chinese Academy of Medical Sciences (Approval No. NCC2024A559). Eight 5-week-old male C57BL/6 mice (Beijing HFK Bioscience Co., Ltd) were acclimated for 7 days and then randomly divided into two groups of four: the ESCC group and the normal group. In the ESCC group, 1×10^6 AKR cells were injected subcutaneously into the right axillary region of each mouse to establish a mouse model of ESCC. Tumor size was measured from the third day. Tumor volume was calculated using the formula: V = 1/2 × (length × width^2) and tumor weight was recorded during necropsy. On the 10th day post-inoculation, mice were euthanized in a CO_2_ chamber. Tumor tissues were immediately collected from ESCC mice, and normal esophageal tissue was harvested from controls.
Western blotting validation of the 5 biomarkers expression
2.6
Flash-frozen ESCC tumors or adjacent normal epithelium (20 mg, n = 4 per group) were minced on ice and lysed in 150 µL ice-cold RIPA buffer (CST #9806) containing protease and phosphatase inhibitors plus 3 mm steel beads. Tissue was homogenized at 60 Hz for 5 cycles and centrifuged at 12–000 g, 4 °C for 20 min. Supernatants were quantified with a BCA kit (Sigma-Aldrich, V900933) and adjusted to 2 µg/µL with lysis buffer. Equal protein (20 µg) was mixed with 5 × Laemmli buffer, denatured at 100 °C for 15 min, loaded onto 10% SDS-PAGE gels (with 5% stacking gel), and electrophoresed at 120 V until bromophenol blue reached the gel bottom. Proteins were transferred to 0.22 µm PVDF membranes via wet transfer, blocked with 5% skim milk in TBST for 1 h at room temperature (RT), and incubated overnight at 4 °C with primary antibodies: anti-BID (1:1000, ab317809, Abcam), anti-CBX3 (1:1000, ab213167, Abcam), anti-ECT2 (1:2000, ab236502, Abcam), anti-KIF14 (1:1000, PA5-68145, ThermoFisher), anti-SORBS2 (1:500, PA5-67869, ThermoFisher). After three 5-min TBST washes, membranes were incubated with HRP-conjugated goat anti-rabbit IgG (1:5000, Zhengneng Biotech, 511103) for 1 h at RT. Signals were developed with ECL (Millipore, WBKLS0500) and imaged using a Liuyi Instruments chemiluminescence imager (WD-9423C). Band densities were quantified in ImageJ, normalized to β-actin, and plotted as mean ± SEM using GraphPad Prism.
Single-cell RNA sequencing data acquisition and analysis
2.7
Public scRNA - seq datasets (GSE196756) were retrieved from GEO and processed using Seurat (v4.0.4). Raw reads were aligned to GRCh38 using Cell Ranger. Cells under the following conditions were not included in this study (1): Cells outside the 1,000–35,000 detected-gene counts were discarded (2); Cells with <200 or >7,500 detected RNA counts were removed (3); Any cell whose mitochondrial transcripts exceeded 10% of its total gene counts was filtered out. Batch effects were corrected using Harmony. Dimensionality reduction was performed via PCA and UMAP. Clustering used Louvain algorithm (resolution = 0.2). Cell types were annotated by canonical markers. Differential expression analysis (Wilcoxon test, p < 0.05, |log2FC|>0.25) identified DEGs between clusters. Cell - cell communication was predicted by cellchat.
Immunofluorescence assay
2.8
The acquisition of all human tissues in this study was approved by the Ethics Committee of Cancer Hospital, Chinese Academy of Medical Sciences (Approval No. 23/510-4253). Formalin-fixed paraffin-embedded (FFPE) sections (5 µm) were dewaxed in xylene (3 × 5 min), rehydrated through graded ethanol (100%→95%→80%→70%; 3 min each), and microwave-heated in 10 mM citrate buffer (pH 6.0, 95 °C, 15 min). After cooling to RT, sections were blocked with 5% bovine serum albumin (BSA/PBS) for 30 min at 25 °C. Co-incubation with primary antibodies—transgelin/SM22 Polyclonal antibody (1:200, 10493-1-AP, proteintech) and SORBS2 Polyclonal antibody (1:500, 24643-1-AP, proteintech)—proceeded overnight at 4 °C in a humidified chamber. Following triple PBS washes (5 min each), sections were incubated with Alexa Fluor 488-conjugated goat anti-mouse IgG (1:500, A-11001, Invitrogen) and Alexa Fluor 594-conjugated goat anti-rabbit IgG (1:500, A-11012, Invitrogen) for 1.5 h at RT in the dark. After extensive PBS washing, slides were mounted with DAPI-containing anti-fade medium (ProLong™ Gold, P36930, Thermo Fisher), cured overnight at 4 °C in darkness, and imaged using a Zeiss LSM 900 confocal microscope with Plan-Apochromat 63×/1.40 Oil objective. Sequential scanning (405/488/594 nm lasers) prevented spectral crosstalk. Quantitative analysis of fluorescence intensity was performed on ≥5 randomly selected fields per sample using ImageJ (v1.53k, NIH) with region-of-interest (ROI) measurements after background subtraction.
Statistical analysis
2.9
All bioinformatics analyses were performed in R v4.3.1. Key packages included: limma v3.56.0 for differential expression analysis; WGCNA v1.72–2 for co-expression network construction; ggplot2 v3.5.1 for data visualization; inSilicoMerging v1.34.0 for batch effect correction. Experimental data were presented as mean ± standard deviation (SD). For comparing data between two groups, a t-test was applied, while for comparisons among multiple groups, one-way analysis of variance (ANOVA) was used. Statistical analysis and graphing were conducted using GraphPad Prism 10 software. A P-value < 0.05 indicates a statistically significant difference.
Results
3
DEGs revealed 113 significantly up-regulated and 173 significantly down-regulated genes
3.1
As shown in Figures 2A, B, the sample distributions across datasets exhibited notable differences before batch effect removal, as evidenced by box plots, which underscored the presence of batch effects. Post-removal, the data distributions across datasets became more consistent. Through fold change (FC) and P-value filtering (|log2FC|>0.585 and P<0.05), a total of 113 genes were found to be significantly upregulated and 173 genes significantly downregulated in squamous carcinoma tissues (Figures 2C, D).
Integrative analysis of DEGs. (A, B) The box plots of the merged database. (C) Volcano map of DEGs. (D) Heatmap of DEGs.
GSEA for the DEGs and merged dataset revealed key pathways and biological processes
3.2
We conducted GO and KEGG Enrichment Analysis on the genes of DEGs. In terms of BP, they were mainly enriched in the cell differentiation, cell adhesion, and positive regulation of G2/M transition of mitotic cell cycle (Figure 3A). In terms of MF, they were mainly enriched in the cytoskeletal protein binding, ubiquitin-like protein ligase binding and growth factor binding (Figure 3B). In terms of CC, they were mainly enriched in the extracellular matrix component, extracellular space, and extracellular matrix component (Figure 3C). In terms of KEGG, they were mainly enriched in the Platelet activation, Cysteine and methionine metabolism, and TGF-beta signaling pathway (Figure 3D).
Gene set enrichment analysis (GSEA) for the DEGs and merged dataset. (A) Bubble plot for BP on the genes of DEGs. (B) Bubble plot for MF on the genes of DEGs. (C) Bubble plot for CC on the genes of DEGs. (D) Bubble plot for KEGG enrichment analysis on the genes of DEGs. (E) Enrichment plot for BP on the merged dataset. (F) Enrichment plot for MF on the merged dataset. (G) Enrichment plot for CC on the merged dataset. (H) Enrichment plot for KEGG on the merged dataset.
Then, the merged dataset was subjected to GSEA (v4.1.0). In terms of BP, the signal transduction by p53 class mediator, DNA-dependent DNA replication, and mRNA transport were all markedly up-regulated in ESCC (Figure 3E). In terms of MF, the snoRNA binding, dopamine receptor binding, and threonine-type peptidase activity were all markedly up-regulated in ESCC (Figure 3F). In terms of CC, the host cellular component, condensed chromosome, and nuclear pore were all markedly up-regulated in ESCC (Figure 3G). In terms of KEGG, the DNA replication, cell cycle, and spliceosome were all markedly up-regulated in ESCC (Figure 3H).
The WGCNA analysis revealed a blue module most related to ESCC, comprising 13 genes
3.3
The WGCNA package was utilized to establish gene co-expression networks. A scatter plot depicting the fit coefficients for the scale-free topology model was generated (Figure 4A), with the red pentagram marking the initial point surpassing the scale-free topology model fit index R²>0.85, associated with a soft threshold β of 4, and an R² value of 0.88. Following the transformation of the correlation matrix into an adjacency matrix using this soft threshold β, a topological overlap matrix and a gene hierarchical clustering dendrogram were assembled. The distances among each module were depicted using a co-expression cluster dendrogram (Figure 4B), leading to the identification of 5 gene modules, which are illustrated as clustering dendrograms of genes highlighting the correlations among different models (Figure 4C). As depicted in Figure 4D, the blue module demonstrated the highest correlation with esophageal squamous cell carcinoma (correlation index: 0.82, P<0.001). This gene model encompasses 13 genes, including ANP32E, ATAD2, BID, CBX3, CCNB1, DTL, ECT2, GMPS, KIF14, MCM10, NDC1, NETO2, and SORBS2.
The blue module exhibited the strongest correlation with ESCC. (A) The scale-free fit index for soft-thresholding powers. (B) Coexpression cluster dendrogram. (C) Clustering dendrograms of genes. (D) A heatmap showing the correlation between the gene module and ESCC.
GSEA for the genes in the blue module which were DEGs
3.4
We intersected the genes in the blue module identified by WGCNA with the related genes in the DEGs and obtained 13 genes (Figure 5A). Based on the 13 genes that overlapped between the identified modules and DEGs, we performed PPI network analysis to assess their interactions (Figure 5B).
Integrative analysis of the blue module. (A) Venn diagram plotting the intersection of WGCNA with DEG. (B) The protein–protein interaction (PPI) plot of the genes in the blue module. (C) Bubble plot for BP on the genes in the Blue Module. (D) Bubble plot for MF on the genes in the Blue Module. (E) Bubble plot for CC on the genes in the Blue Module. (F) Bubble plot for CC on the genes in the Blue Module.
Subsequently, we conducted GSEA on the genes of the blue module identified by WGCNA. In terms of BP, they were mainly enriched in the cell cycle, G1/S transition of mitotic cell cycle, and mitotic cell cycle (Figure 5E). In terms of MF, they were mainly enriched in the enzyme binding, cytoskeletal adaptor activity and death receptor bingding (Figure 5F). In terms of CC, they were mainly enriched in the cytoskeleton, nuclear membrane, and non-membrane-bounded organelle (Figure 5G). In terms of KEGG, they were mainly enriched in the p53 signaling pathway, cell cycle, and sphingolipid signaling pathway (Figure 5H).
Construction of ESCC diagnostic model and external validation
3.5
The 13 genes in the blue blocks were selected as potential predictors for LASSO regression analysis, and the LASSO coefficient path plot was shown in Figure 6A. There are a total of 5 non-zero coefficient genes, including BID (coefficient 0.37), CBX3 (coefficient 0.03), ECT2 (coefficient 0.42), KIF14 (coefficient 0.38), and SORBS2 (coefficient -0.12). The coefficients in the LASSO regression model were visualized in Figure 6B. The ESCC diagnostic model was developed using these five genes and analyzed through multivariate logistic regression. In the context of multivariate logistic regression analysis, SORBS2 emerged as a statistically significant predictor of ESCC (Table 1). Our predictive model was visualized as a user-friendly nomogram (Figure 6C).
Diagnostic model construction and evaluation. (A, B) Construction of the diagnostic model based on the LASSO algorithm. (C) Nomogram based on multivariate logistic regression analysis. (D, E) ROC curves for training and external validation sets. (F) Calibration curve of the training set. (G) Calibration curve of the external validation set (TCGA). (H, I) Decision Curve Analysis (DCA) plots for training and external validation sets.
We used TCGA as an external validation to evaluate the performance of the diagnostic model. The AUC values for the training and test sets were 0.99 (0.98, 1.00) and 0.98 (0.94, 1.00) (Figures 6D, E), and the performance of the diagnostic model is presented in Table 2, indicating that the model had good accuracy. The calibration curve suggests acceptable model calibration, with good agreement between actual frequencies and predicted probabilities (Figures 6F, G). The R²in the training set was 0.724; the bootstrap-internal validation yielded an R² of 0.6663, while the external validation set (TCGA) achieved an R²of 0.702. Ultimately, we generated DCA curves to demonstrate the practical utility of our model in a clinical setting (Figures 6H, I). Stratified analysis by ancestry indicated consistent discrimination in Asian (n=25; AUC = 0.95, R² = 0.85, Brier = 0.04) and stable performance in the White subgroup (n=57; AUC = 0.98, R² = 0.84, Brier = 0.04). See Table 2 for details. Due to the small number of Black or African-American cases (n = 2) and the potentially heterogeneous composition of the Unknown/Mixed group (n = 7), performance metrics for these subgroups are not presented in Table 2 to avoid unreliable estimates and potential bias.
Experimental verification of signature gene expression in mice ESCC model
3.6
To confirm the expression differences between ESCC and normal tissues of five genes, we established a Mice ESCC Model (Figure 7A). We found that the ESCC model mice didn’t have a significant increase in body weight compared to normal mice (Figure 7B). The tumor growth trend of the ESCC model is shown in Figure 7C. Comparison of normal esophageal tissue and ESCC tissue is shown in Figure 7D. The HE staining revealed disrupted tissue architecture in the ESCC mouse model, characterized by the size and shape of the cells are extremely irregular, and the cell arrangement is disordered compared to normal mice (Figure 7E). Subsequently, we performed Western blotting to evaluate the protein levels, which showed a significant increase in BID, CBX3, ECT2, KIF14 protein expression, and a decrease in SORBS2 protein expression in ESCC tissues compared to the adjacent normal esophageal tissue (Figures 7F, G).
*Experimental verification of signature gene expression in mouse ESCC model. (A) The modeling process of Mice ESCC Model. (B) The body weight of mouse ESCC model and mouse normal model. (C) Tumor size of ESCC group, (D) Comparison of normal esophageal tissue and ESCC tissue. (E) The HE assays of mouse ESCC model and normal model. Magnification: 100x. Scale bar: 200μm (F) Western Blot of BID, CBX3, ECT2, KIF14, SORBS2 expression in ESCC model and normal model. (G) The gray scale ratio of the five genes expressed in ESCC model and normal model. ***P < 0.0001.
Immune cell infiltration analysis revealed that impaired anti-tumor immunity was correlated with SORBS2 low expression in the immune microenvironment of ESCC
3.7
We analyzed the IME of ESCC using the CIBERSORT algorithm, which profiles 22 distinct immune cell types. The results showed that out of these 22 cell types, a total of 11 microenvironment cells exhibited significant differences between the normal and ESCC groups (Figure 8A). Notably, four lymphocyte populations—memory B cells, CD8+ T cells, follicular helper T cells, and resting mast cells—demonstrated significantly attenuated infiltration within IME relative to normal counterparts. This immunophenotypic landscape indicated profound impairment of anti-tumor immunity. In addition, we analyzed the correlation between SORBS2 expression and immune cell infiltration in the IME of ESCC, revealing that memory B cells, CD8+ T cells, follicular helper T cells, and resting mast cells attenuated infiltration within the low-expression SORBS2 group relative to the high-expression SORBS2 group (Figure 8B). Therefore, low expression of SORBS2 in the IME is associated with the impaired anti-tumor immune function of ESCC.
Immune infiltration analysis. (A) Expression of different immune cells in ESCC and normal groups. (B) Expression of different immune cells in high-expression and low-expression SORBS2 groups.
GSEA revealed the function of SORBS2 in immune response and cellular signaling pathways
3.8
To explore the functional significance of SORBS2, we conducted a GSEA. GO enrichment analysis revealing that BP of SORBS2 was predominantly involved in humoral immune response mediated by circulating immunoglobulin, immune response−regulating cell surface receptor signaling pathway involved in phagocytosis, and immune response−activating cell surface receptor signaling pathway (Figure 9A). In terms of CC, SORBS2 was enriched in the ribonucleoprotein complex, T cell receptor complex, and receptor complex (Figure 9B), while at the MF, SORBS2 was associated with structural constituent of ribosome, immune receptor activity and signaling receptor activity (Figure 9C). Utilizing KEGG enrichment analysis, SORBS2 demonstrated significant relationships with cell adhesion molecules, spliceosome and neuroactive ligand−receptor interaction pathways (Figure 9D).
GSEA for SORBS2. (A) BP. (B) CC. (C) MF. (D) KEGG.
Single-cell analysis revealed predominant SORBS2 expression in myofibroblast and vascular endothelial cell clusters
3.9
Six samples from the GSE196756 database underwent single-cell sequencing analysis, revealing a positive correlation between nCount RNA and nFeature RNA (Figure 10A). The variance plot identified 250 genes across all cells, with 2000 highly variable genes marked in red and the top 10 genes labeled (Figure 10B). Figures 10C, D illustrated that the cells were classified into 10 clusters. Figure 10E demonstrated that ten identified marker genes were used to show in different clusters, while the heatmap displays the top 5 marker genes for each cell type (Figure 10F). Based on key marker gene expression, these clusters were classified as B cells, cancer cells, CD8 T cells, DC cells, fibroblasts, macrophages, myofibroblasts, and vascular endothelial cells (Figures 10G, H). Figure 10I illustrates the distribution of five genes across cell types. Figures 10J-L indicate predominant SORBS2 expression in clusters 8 and 9, particularly within myofibroblast and vascular endothelial cell populations.
Analysis of single-cell RNA sequencing data of six esophageal samples. (A) The correlation for nCount_RNA between percent. Mt and nFeature_RNA. (B) Identification of highly variable genes achieved through batch removal post-count. (C, D) t-SNE or UMAP plots to identify each cell cluster in esophageal cancer. (E) Ten identified marker genes were used to show in different clusters. (F) Heatmap to show marker genes of ten cell clusters. (G, H) t-SNE or UMAP plots to identify each cell type in esophageal cancer. (I) Heatmap to show the expression of five hub genes in eight cell types. (J-L) Feature and violin plots to show the distribution of SORBS2 in various cell types.
Inferences of cell–cell communication by cellchat revealed the signaling pathways involving CDH5 and PECAM1 in vascular endothelial cells and myofibroblasts
3.10
The circular plots of the cellular interaction strength between each cell type were shown in Figure 11A. There was a high correlation between vascular endothelial cells and myofibroblasts (Figure 11B). We further investigated the signaling network of vascular endothelial cells and myofibroblast-related molecules, including CDH5 and PECAM1. In the CDH5 signaling network, vascular endothelial cells and myofibroblasts are the main signal transmitters and have a close relationship with cancer cells (Figures 11C, D). The PECAM1 signaling network also shows that vascular endothelium and myofibroblasts have a close relationship with cancer cells (Figures 11E, F). In addition, we investigated immune-related molecular signaling pathways, including PD-L1 and VISTA. The results showed that Myofibroblasts have a relatively close relationship with cancer cells in the PD-L1 signaling network and mainly interact significantly with cancer cells and CD8+ T cells in the VISTA signaling network (Figures 11G-J).
Inferences of cell–cell communication by cellchat. (A) CellChat network depicting interaction strengths. (B) Heatmap showing the correlation between different cell types. (C, D) The communication between different cell types is mediated by CDH5 signaling pathways. (E, F) The communication between different cell types is mediated by PECAM1 signaling pathways. (G, H) The communication between different cell types is mediated by PD-L1 signaling pathways. (I, J) The communication between different cell types is mediated by VISTA signaling pathways.
Human ESCC tissue immunofluorescence review SORBS2 a lower expression in ESCC tumor tissue
3.11
As it is shown in Figure 12A, DAPI (blue) marked all nucleated cells, TAGLN (green) identified myofibroblasts, and SORBS2 (red) highlighted cells expressing this tumor-suppressive protein. The expression of SORBS2 is higher in adjacent normal tissues compared to ESCC tissues. Moreover, in adjacent normal tissues, SORBS2 shows stronger co-localization with myofibroblasts than in ESCC (Figure 12B).
*Human ESCC Tissue Immunofluorescence. (A) The expression of DAPI (blue, nucleus), TAGLN (green), and SORBS2 (red) in normal and ESCC cells. The merged image (Merge) displays the co-localization of TAGLN, SORBS2, and DAPI. Magnification: 400x. Scale bar: 100μm. (B) Bar graphs show the relative area percentage of TAGLN and SORBS2 in cells, as well as the co-localization. RGB: RED+GREEN+BLUE. RGB/TAGLN: the proportion of cells expressing TAGLN among all cells expressing all three colors; RGB/Nucleus: the proportion of nucleated cells among all cells expressing all three colors. ***P < 0.0001.
Discussion
4
In this study, we integrated DEG and WGCNA analyses to identify 13 key differentially expressed genes most strongly associated with ESCC, including ANP32E, ATAD2, BID, CBX3, CCNB1, DTL, ECT2, GMPS, KIF14, MCM10, NDC1, NETO2, and SORBS2. We screened out five genes with non-zero gene coefficients, including BID, CBX3, ECT2, KIF4, and SORBS2, through the LASSO algorithm. BID acted as a crucial link between death receptor-initiated signals and the core mitochondrial pathway of apoptosis, adept at orchestrating a suite of cellular functions, including apoptosis, survival, and proliferation of tumor cells (20). BID was now on the radar for its apoptotic superpowers over gastric, liver, and esophageal cancer cells (21, 22). In the case of esophageal cancer cells, pro-apoptotic BCL-2 family proteins, including BID, translocated to the mitochondrial outer membrane, sparking the release of pro-apoptotic factors and precipitating cellular suicide (23). Chromebox protein homolog 3 (CBX3), a member of the heterochromatin-associated protein 1 (HP1) family, has been detected as overexpressed in EC (24). Overexpression of CBX3 enhances cell proliferation and migration while suppressing apoptosis in ESCC via the JAK2/STAT3 signaling pathway (25). Conversely, CBX3 inhibition activated the P53/P21 pathway, impairing the self-renewal capacity of ESCC stem cells (26). The expression of ECT2 (Epithelial Cell Transforming Sequence 2) was upregulated in ESCC, and its elevated expression has been correlated with enhanced proliferation, colony formation, migration, and invasion capabilities of ESCC cells, as well as a reduction in apoptotic rates (27–29). Additionally, a significant correlation existed between ECT2 expression levels and patient prognosis. Higher expression levels were associated with worse prognoses (30). KIF14(Kinesin Family Member 14), as a microtubule motor protein, with its expressed at levels that were closely associated with the phenotype of cancer cells (31, 32). KIF14 was one of the most significant central genes implicated in the development of ESCC through hypermethylation (33). A decrease in its expression level could not only inhibit the proliferation, invasion, migration, and angiogenesis of ESCC cells (34), but also might enhance the sensitivity of tumors to chemotherapy drugs (35). SORBS2 was an RNA-binding protein that could inhibit the migration and invasion of hepatocellular carcinoma and colorectal cancer cells (36–38), indicating that SORBS2 might have a protective effect on malignant tumors of the digestive system. This finding was consistent with our research conclusion that SORBS2 was a protective factor for ESCC.
Next, we constructed a predictive model using multivariate logistic regression. This model demonstrated significant efficacy in both the training and external validation sets, with an area under the curve (AUC) value exceeding 0.9. This indicated that our model possessed outstanding predictive capability. The R² values were 0.724 in the training set, 0.6663 in bootstrap-internal validation, and 0.702 in the external TCGA cohort. This agreement indicated minimal overfitting and strong generalizability of the model. We then created a diagnostic nomogram to illustrate the diagnostic model. The calibration curve showed good consistency between the actual frequency and the predicted probability, suggesting that the diagnostic model was highly accurate. Additionally, SORBS2 was the only protective factor for ESCC and a statistically significant predictive factor, with an odds ratio (OR) of 0.12.
In the ESCC mouse model, our study revealed that BID, CBX3, ECT2, and KIF14 expression levels were significantly higher in ESCC tissues than in adjacent normal esophageal tissues, while SORBS2 expression was significantly downregulated.
Finally, we performed a single-cell RNA analysis. The results showed that SORBS2 was predominantly expressed in myofibroblast and vascular endothelial cell populations. CDH5 and PECAM1 were both related to vascular endothelial cells and myofibroblasts. Their signaling networks have shown that vascular endothelial cells and myofibroblasts were closely associated with cancer cells. A recent study has shown that SORBS2 reduced the degradation of the extracellular matrix, angiogenesis, and the invasion and metastasis of tumors in ESCC by upregulating TIMP and inhibiting MMP activity (39). Additionally, in lung adenocarcinoma, MMP9 was upregulated while CDH5 and PECAM1 were downregulated, resulting in vascular abnormalities (40). Therefore, low expression of SORBS2 was associated with altered activity of the CDH5 and PECAM1 signaling pathways, suggesting its potential role in angiogenesis and stromal remodeling in the tumor microenvironment, which might contribute to attenuated tumor invasion and metastasis. In addition, myofibroblasts had a strong interaction relationship with cancer cells and CD8^+^ T cells in the PD-L1 signaling network and the VISTA signaling network. As we all know, PD-L1 and VISTA were immune checkpoint molecules that could interact with cell surface receptors, regulate immune responses, and play an important role in tumor immune escape (41, 42). Studies have shown that myofibroblasts could overexpress PD-L1 and VISTA in cancer tissue (43–45). The results of immune infiltration analysis and GSEA indicated that high expression of SORBS2 was correlated with a less immunosuppressive tumor microenvironment and was enriched in immune response–activating cell surface receptor signaling pathways, suggesting its potential involvement in anti-tumor immune regulation.
However, our study still had some limitations. First, our data came from publicly available datasets, which might have sample selection and ethnographic bias. Across the two largest ancestry subgroups (Asian and White), we successfully validated the model’s predictive efficacy in the Asian and White populations within the TCGA database, indicating stable model predictive performance; we would conduct large, multi-center cohorts to confirm generalizability across all ancestries. Secondly, we have not conducted further mechanistic investigations to elucidate how these five genes drive the initiation and progression of ESCC. To further verify the role of SORBS2 in promoting antitumor immunity, we will conduct in vitro and in vivo experiments. Specifically, we will knock out or overexpress the SORBS2 gene to observe the changes in cellular behavior. In mouse tumor models, we will knock out or overexpress SORBS2 to investigate the changes in tumor growth and immune cell infiltration. In the future, we will launch a prospective cohort study to externally validate the predictive performance of the model in a large, independent population. For clinical translation, the five-gene signature would be detected in endoscopic biopsy specimens via a practical RT-qPCR assay.
Conclusion
5
Our study developed a highly accurate ESCC diagnostic model in the Asian population. Additionally, our findings suggested that high expression of SORBS2 was associated with a tumor microenvironment that favored anti-tumor immunity, potentially involving immune response-related cell-surface receptor signaling pathways.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Sung H Ferlay J Siegel RL Laversanne M Soerjomataram I Jemal A . Global cancer statistics 2020: globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: A Cancer J Clin. (2021) 71:209–49. doi: 10.3322/caac.21660, PMID: 33538338 · doi ↗ · pubmed ↗
- 2Cao W Chen HD Yu YW Li N Chen WQ . Changing profiles of cancer burden worldwide and in China: A secondary analysis of the global cancer statistics 2020. Chin Med J. (2021) 134:783–91. doi: 10.1097/cm 9.0000000000001474, PMID: 33734139 PMC 8104205 · doi ↗ · pubmed ↗
- 3Sun G Yang Y Liu J Gao Z Xu T Chai J . Cancer stem cells in esophageal squamous cell carcinoma. Pathol Res Pract. (2022) 237:154043. doi: 10.1016/j.prp.2022.154043, PMID: 35926434 · doi ↗ · pubmed ↗
- 4Zeng H Chen W Zheng R Zhang S Ji JS Zou X . Changing cancer survival in China during 2003-15: A pooled analysis of 17 population-based cancer registries. Lancet Global Health. (2018) 6:e 555–e 67. doi: 10.1016/s 2214-109x(18)30127-x, PMID: 29653628 · doi ↗ · pubmed ↗
- 5Soares-Lima SC Gonzaga IM Camuzi D Nicolau-Neto P Vieira da Silva R Guaraldi S . Il 6 and bcl 3 expression are potential biomarkers in esophageal squamous cell carcinoma. Front Oncol. (2021) 11:722417. doi: 10.3389/fonc.2021.722417, PMID: 34422669 PMC 8371528 · doi ↗ · pubmed ↗
- 6Zhang J Zhu Z Liu Y Jin X Xu Z Yu Q . Diagnostic value of multiple tumor markers for patients with esophageal carcinoma. Plo S One. (2015) 10:e 0116951. doi: 10.1371/journal.pone.0116951, PMID: 25693076 PMC 4333286 · doi ↗ · pubmed ↗
- 7Mroczko B Kozłowski M Groblewska M Łukaszewicz M Nikliński J Jelski W . The diagnostic value of the measurement of matrix metalloproteinase 9 (Mmp-9), squamous cell cancer antigen (Scc) and carcinoembryonic antigen (Cea) in the sera of esophageal cancer patients. Clin Chim Acta Int J Clin Chem. (2008) 389:61–6. doi: 10.1016/j.cca.2007.11.023, PMID: 18155162 · doi ↗ · pubmed ↗
- 8Wang C Guan S Liu F Chen X Han L Wang D . Prognostic and diagnostic potential of mir-146a in oesophageal squamous cell carcinoma. Br J Cancer. (2016) 114:290–7. doi: 10.1038/bjc.2015.463, PMID: 26794279 PMC 4742585 · doi ↗ · pubmed ↗
