Precision Biomarker Identification in Gynecological Cancers Using Coexpression Networks and Attention-Based LSTM in Healthcare 4.0
Sakib Sarker, Emon Ahammed, Md. Faruk Hosen, Mohammad Badrul Alam Miah, Mohammad Amanul Islam, Deepak Ghimire, Youngbae Hwang, A. S. M. Sanwar Hosen

TL;DR
This study identifies four key genes that could help in early diagnosis and treatment of cervical and ovarian cancers using advanced machine learning and network analysis.
Contribution
The novel integration of attention-based LSTM and coexpression networks to identify precision biomarkers in gynecological cancers.
Findings
FOXM1 and MCM3 are upregulated in cancer tissues, while SH3BP5 and PAPSS2 are downregulated.
The four hub genes show strong prognostic significance with high AUC scores in ROC analysis.
Molecular docking confirms significant binding affinity between the genes and FDA-approved drug compounds.
Abstract
Background: Cervical cancer (CC) and ovarian cancer (OC) are among the most prevalent and lethal gynecological malignancies in women, necessitating the identification of reliable biomarkers for early diagnosis and prognosis. Methods: This study integrates bioinformatics and Healthcare 4.0 to identify key biomarkers associated with these cancers. Differentially expressed genes (DEGs) were identified from two microarray datasets. mRMR followed by SVM-RFE was applied to the identified DEGs to extract the most significant ML-based DEGs (MDEGs). The predictive ability of the selected gene subsets was further evaluated via multiple classifiers, where attention-based long short-term memory (AttLSTM) consistently achieved the best performance across both datasets. In parallel, WGCNA was conducted to identify coexpression-associated genes (CAGs) from significant modules in each dataset. A PPI…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13- —Grand Information Technology Research Support Program, Republic of Korea
- —Chungbuk National University BK21 program (2025)
- —Regional Innovation System & Education (RISE) program
- —Woosong University Academic Research Fund 2026, Korea
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFOXO transcription factor regulation · Ferroptosis and cancer prognosis · Gene expression and cancer classification
1. Introduction
Cervical cancer (CC) ranks as the fourth most prevalent malignancy affecting women globally [1] and remains one of the most prevalent female malignancies. Its development is mainly driven by persistent infection with perilous human papillomavirus (HPV) [2]. An estimated 600,000 new cases and 342,000 deaths occurred worldwide in 2020 [3]. While its incidence has decreased in the United States, it remains a leading cause of cancer-related mortality in developing countries [1]. Ovarian cancer (OC) is one of the most frequently diagnosed gynecological cancers and remains a leading cause of death related to gynecological malignancies worldwide. In 2018, approximately 250,000 new cases and 160,000 deaths were reported [4]. Several risk parameters are linked with OC, including a family medical record of breast cancer or OC, obesity, smoking, early menstruation, late menopause, and not giving birth. Because of its subtle manifestation, more than 70% of OC cases are diagnosed in later stages [5]. Both cervical and ovarian cancers affect the female reproductive system and are associated with high mortality rates [6]. Owing to the high mortality associated with these cancers, biomarkers can serve an indispensable role in their early diagnosis and prognosis.
Recently, gene expression analysis has become a key approach for identifying important genes associated with various cancers. Yang et al. [7] identified six HGs (IGFBP5, KRT1, IVL, CCNA2, AURKA, and TOP2A) through a protein–protein interaction (PPI) network. They reported upregulated expression of CCNA2, AURKA, and TOP2A, along with downregulated expression of IGFBP5 in cervical cancer samples. Karunakara et al. [8] conducted an in-depth analysis of the miR-497/195 cluster to assess its prognostic value and role in CC. They identified ten HGs from a PPIN and found that the upregulated expression of three genes (RECK, ATD5, and BCL2) and the downregulated expression of HIST1H3H, RCAN3, and OSBPL3 influenced the overall survival of CC patients. Honar et al. [9] performed a comprehensive analysis of high-grade primary tumors (HGPTs) and identified the top ten HGs from their PPIN. They also identified key transcription factors (FOXM1, NFYA, TCF7L2, VDR, and SIN3A) associated with HGPTs. Additionally, they highlighted NUSAP1, CDCA5, CKS2, PRC1, and CEP55 as potential biomarkers. Shihao et al. [10] implemented a coexpression network analysis alongside differential expression analysis. They identified three HGs (CLDN3, IRF6, and PRSS8) from the PPI network and reported that IRF6 was strongly associated with OC development and progression.
Although cervical and ovarian cancers have been widely studied, there are still opportunities for further research. Most previous studies have focused primarily on identifying hub genes within PPI networks. However, discovering essential genes or reliable biomarkers remains a significant challenge in genetic data analysis. The challenge arises from the sparse nature of high-dimensional gene expression data, which complicates effective analysis and leads to the phenomenon known as the “curse of dimensionality” [11]. Additionally, these datasets often contain redundant information, since genes with identical expression patterns introduce redundancy [12]. Recent developments in Healthcare 4.0, the fourth revolution in healthcare, offer transformative potential for addressing these challenges. Healthcare 4.0 integrates pioneering technologies such as artificial intelligence (AI), big data analytics, the Internet of Things (IoT), cloud computing, and smart interconnected medical devices to enable real-time personalized healthcare [13]. By leveraging Healthcare 4.0 principles, the identification of crucial gene expression biomarkers for various cancers can be significantly enhanced through sophisticated machine learning and bioinformatics techniques. Utilizing these opportunities, integrated machine learning (ML) and bioinformatics approaches [14,15,16] have gained increasing adoption in recent years. Among these methods, the effectiveness of the mRMR method lies in its ability to precisely select the most relevant genes linked to disease while effectively reducing redundancy among genes [12,17]. Furthermore, SVM-RFE enhances feature selection by identifying the most insightful subset of genes, improving the robustness of biomarker discovery [18].
In this study, two microarray datasets were utilized, one for each disease. The analysis was conducted from two perspectives: ML-based DEG identification and WGCNA-based gene identification. For ML-based DEGs, we first identified DEGs from both datasets, followed by a two-step gene selection process. High-dimensional gene expression data contain thousands of features but limited samples, increasing noise, redundancy, and overfitting risk. To address this, a two-step feature selection strategy combining mRMR and SVM-RFE was applied. mRMR first selects genes highly relevant to the phenotype while minimizing redundancy, and SVM-RFE further refines this set by iteratively removing less discriminative features, yielding a compact and informative gene signature considered as ML-based DEGs (MDEGs). To model complex nonlinear gene interactions, an AttLSTM network was employed. The LSTM captures feature dependencies, while the attention mechanism emphasizes the most informative genes, improving representation learning and classification robustness. Together, this framework enhances biomarker discovery by reducing dimensional noise and enabling more effective learning of gene interaction patterns. Simultaneously, coexpression networks were constructed via WGCNA, and significant coexpression-associated genes (CAGs) were extracted from key modules. Functional annotation was performed using clusterProfiler. A PPIN was then constructed using the genes common to MDEGs and CAGs. The identified hub genes (HGs) were further validated by evaluating their discriminative power through receiver operating characteristic (ROC) curves and analyzing their mRNA expression levels in tumor samples. Additionally, we conducted comprehensive gene regulatory analyses as well as molecular docking to analyze the binding affinity of the HGs and FDA-approved drug compound for the treatment of cervical and ovarian cancer. The complete framework is presented in Figure 1 and system Algorithm 1. Algorithm 1: Proposed Bioinformatics and Machine Learning-Based Biomarker Identification FrameworkInput: Microarray datasets for cervical and ovarian cancer.Output: Identified key candidate biomarkers and their drug–target interaction profiles.**1: Differential Expression Analysis:**2: Identify DEGs using the limma package.**3: Two-Step Gene Selection via Machine Learning:**4: Step 2.1: mRMR-based Gene Subset Selection****5: Apply mRMR to minimize redundancy and maximize relevance among DEGs.6: Select optimal gene subset α′.7: Step 2.2: Gene Subset Refinement Using SVM-RFE****8: Train an SVM classifier on α′ with class labels.9: Compute feature ranking weights via recursive feature elimination (RFE).10: Iteratively remove least significant genes.11: Obtain final optimal gene subset α (MDEGs).**12: Co-Expression Network Construction via WGCNA:**13: Build co-expression network using WGCNA.14: Identify gene modules significantly linked with cancer phenotypes.15: Extract module genes as Co-Expression-Associated Genes (CAGs).16: Identify intersecting genes between MDEGs and CAGs.**17: Functional Enrichment and PPIN Analysis:**18: Perform GO and KEGG enrichment using clusterProfiler.19: Construct PPI network via NetworkAnalyst and analyze in Cytoscape.20: Identify hub genes (HGs) via degree centrality.**21: Validation of Identified Hub Genes:**22: Evaluate discriminative power through ROC analysis.23: Analyze expression patterns using GEPIA.24: Perform gene regulatory network analysis using NetworkAnalyst.**25: Molecular Docking:**26: Perform molecular docking with an FDA-approved drug.
2. Materials and Methods
2.1. Background Study
In biomarker discovery research, many earlier studies primarily relied on conventional bioinformatics and systems biology approaches. Although machine learning methods have recently gained popularity, most investigations employ shallow feature selection techniques that may overlook highly informative genes. In contrast, the present study implements a distinct two-step gene selection strategy to enhance discriminative gene prioritization, followed by an attention-based LSTM model to capture complex, nonlinear relationships among gene features. Additionally, coexpression analysis was incorporated to identify biologically relevant gene modules. A detailed comparison with previous studies is presented in Table 1.
2.2. Microarray Data Acquisition
Two gene expression microarray datasets were curated from the publicly available Gene Expression Omnibus (GEO) database. The GSE63514 [22] dataset was collected for cervical cancer, whereas the GSE26712 [23,24] dataset was collected for ovarian cancer. The GSE26712 dataset is based on the GPL96 platform, while the GSE63514 dataset is founded on the GPL570 platform. Additional details about these datasets are outlined in Table 2.
2.3. Data Preprocessing and Screening of DEGs
Differential expression (DE) analysis is a molecular biology technique used to uncover genes with significantly altered expression levels among various sample groups, such as healthy and diseased tissues. The primary aim of DE is to pinpoint genes linked to specific biological conditions, helping to reveal underlying mechanisms, disease associations, and potential responses to treatment by providing insights into gene regulation and related biological processes [25]. Batch effects arising from differences in experimental conditions, platforms, or sample processing can introduce non-biological variation that may confound downstream analyses. To minimize such technical bias and improve data comparability, batch correction was performed using the ComBat function implemented in the sva toolkit in R. This criterion applies an empirical Bayes approach to adjust for systematic batch-related variation while preserving true biological differences between sample groups [26]. The datasets underwent transformation and quantile normalization before DE analysis was conducted. We conducted DE analysis via the limma package [27] in R to compare the expression values among normal and tumor samples and identify DEGs. In cases where several probes corresponded to a single gene symbol, the probe exhibiting the lowest adjusted p-value was designated to represent the gene in the analysis. A set of DEGs, was created by identifying DEGs using the cut-off: adj.p-value < 0.05 and .
2.4. ML-Based Gene Subset Selection
In the era of Healthcare 4.0 [28], utilizing emerging analytical approaches such as artificial intelligence and big data analytics may enable precision diagnosis through the integration of high-dimensional genomic data. However, the vast volume and complexity of gene expression datasets pose significant challenges, which impede accurate biomarker identification. Feature selection is a critical step to enhance the interpretability and predictive power of models by isolating the most relevant and non-redundant genes. Efficient feature selection methods contribute directly to the discovery of reliable biomarkers, facilitating early diagnosis and personalized treatment strategies in line with the goals of Healthcare 4.0 [17].
2.4.1. mRMR-Based Gene Subset Selection
Genomic expression datasets contain thousands of genes, leading to the curse of dimensionality. ML-based feature selection helps reduce this complexity by identifying the most significant genes [29]. Empirical Bayes-moderated tests with FDR adjustments are commonly used to identify DEGs. However, this approach does not eliminate redundancy among pinpointed genes. To address this, the mRMR approach selects features by minimizing redundancy and maximizing relevance, improving gene subset quality [30]. As a highly robust feature selection technique in ML, the mRMR algorithm has gained widespread application in multi-omics study in recent times [12].
In Round-1 of gene subset selection, mRMR was applied to the identified DEGs ( ) in the respective datasets. This method was applied to select an optimal subset of DEGs that balances high relevance to the cancer phenotype with low redundancy among themselves. The mathematical representation for the mRMR technique is as follows [12,31,32]:
Relevance of each gene to the target variable c (cancer status) was quantified via mutual information:
where denotes the mutual information between gene and class label c.
Redundancy between pairs of genes and was measured similarly by mutual information:
mRMR optimizes the selection of a gene subset S by maximizing the difference between average relevance and average redundancy:
This criterion ensures the chosen genes collectively provide maximal information about the phenotype while minimizing overlapping information among them. The genes were ranked according to their mRMR scores, and the gene subset , with the top-ranked genes was passed to the subsequent SVM-RFE step.
2.4.2. Gene Subset Creation via SVM-RFE
The support vector machine (SVM) classifier is a supervised machine learning method that aims to find an optimal hyperplane in a high-dimensional space to maximize the margin between classes [33]. The SVM classifier separates two classes by finding a hyperplane , defined as [33,34]:
where x is the input feature vector, is the Lagrange multiplier, is the class label ( or ), is the kernel function (linear, polynomial, Gaussian, etc.), and b is the bias term.
Support vector machine-recursive feature elimination (SVM-RFE) [34,35] was applied to the gene subset to identify a new, more significant gene subset , which was classified as MDEGs. To prevent information leakage, SVM-RFE–based gene subset selection was performed within a fully nested stratified cross-validation framework. For each training fold, mRMR ranking followed by SVM-RFE was applied exclusively to the training data, and the resulting gene subset was then used to evaluate classification performance on the corresponding unseen test fold. The final gene subset , obtained from this two-step feature selection, was then used as input for multiple classifiers, including XGBoost, convolutional neural network (CNN), AdaBoost, LightGBM, and attention-based long short-term memory (AttLSTM), to evaluate classification performance. To benchmark the effectiveness of the proposed two-step strategy, it was compared against single-step feature selection algorithms, including LASSO, ANOVA, mutual information gain (MiG), and extra trees.
2.5. WGCNA and Module Analysis
Weighted gene coexpression network analysis (WGCNA) is a systems biology approach that analyzes gene expression data to identify patterns of correlation among genes across microarray or RNA-seq samples. It builds a gene coexpression network, detects modules of closely related genes, and links these modules to clinical traits or external characteristics. The main objective of WGCNA is to discover biologically relevant gene groups and potential hub genes that could act as biomarkers or therapeutic targets [36]. Both of our microarray datasets (GSE26712 and GSE63514) were utilized for our analysis and performed coexpression analysis via the WGCNA library in R. Initially, phenotypic data for the study samples were retrieved from the NCBI GEO database using the GEOquery package. To ensure data quality, we used the goodSamplesGenes() function from WGCNA to screen out genes and samples with excessive missing values or outliers. The outlier samples were further detected through hierarchical clustering via the hclust() function and were excluded to prevent distortion of the analysis.
Next, the pickSoftThreshold() procedure was utilized to derive the efficient soft-thresholding power ( ) for constructing the adjacency matrix. Soft-thresholding powers ranging from 1 to 50 (incrementing by 1 initially, then by 2 from 12 onward) were tested. The optimal power was selected on the basis of the criterion . Using the selected soft-thresholding power, we constructed the adjacency matrix and converted it into a topological overlap matrix (TOM). The TOM dissimilarity measure (1 − TOM) was computed, followed by hierarchical clustering with the average linkage method to cluster genes with concordant expression profiles into distinct coexpression modules. Each module was assigned a unique color. Then, gene co-expression architecture was inferred via the blockwiseModules() method, setting the following parameters: power = best threshold power, minModuleSize = 30, deepSplit = 1, TOMType = “unsigned”, mergeCutHeight = 0.25, and randomSeed = 1234. To visualize the clustering results, we generated dendrograms with module colors before and after merging via the plotDendroAndColors() function.
We then computed module eigengenes (MEs) to reflect the overall measure of expression profiles of genes within each module. These eigengenes were used to assess module-trait relationships, where we analyzed the associations between MEs and clinical traits (healthy controls vs. cases). The strength of these associations was visualized in a heatmap, with correlation coefficients indicating their significance. Finally, significant coexpression-associated genes (CAGs) were identified from key modules for subsequent investigations.
2.6. Pathway Enrichment Analysis
This analysis delineates biological pathways significantly correlated with genes exhibiting differential expression or segregated sample clusters. It supports biomarker discovery by highlighting genes that are overrepresented in critical pathways and aids validation by uncovering their functional roles and interactions at the protein level [25]. We performed Gene Ontology (GO) [37] and Kyoto Encyclopedia of Genes and Genomes (KEGG) [38] pathway enrichment analyses on the common MDEGs and CAGs between the two datasets individually via the clusterProfiler [39] package in R and ClueGO [40]. The GO annotations were stratified into three distinct functional classes: biological process (BioProc), cellular component (CellCom), and molecular function (MoleFunc). A relevance threshold of was used to identify statistically notable enrichments.
2.7. PPIN Construction and HG Identification
A PPIN offers a systematic framework for uncovering disease-associated genes by assessing interactions between functionally related genes [41]. Common genes between MDEGs and CAGs were identified, and a PPIN was constructed using NetworkAnalyst. The network was then visualized and examined in Cytoscape v3.0. HGs were recognized by applying the degree centrality in the cytoHubba [42] feature, genes with at least 10 interactions were selected.
2.8. Validation of the Identified HGs
2.8.1. Diagnostic Efficacy Evaluation via ROC Analysis
For the validation of the identified HGs, their diagnostic potential in distinguishing normal and cancer samples was assessed. To evaluate their discriminative power, we applied leave-one-out cross-validation (LOOCV) protocol with logistic regression (LR) on both the GSE63514 and GSE26712 datasets. Logistic regression, a widely used method for binary classification, was chosen beacuse of its effectiveness in modeling the relationship between gene expression levels and disease status [43]. In addition, the diagnostic performance of the identified hub genes was evaluated using two independent GEO datasets (GSE9750 and GSE37582).
2.8.2. mRNA Expression Levels and Survival Analysis
GEPIA serves as an online interactive interface designed for rapid and customizable analysis of gene expression data derived from TCGA and GTEx cohorts [44]. To validate the identified HGs, GEPIA was employed to interpret their expression levels in CC and OC tissues relative to adjacent healthy control samples. The analysis was visualized using box plots, with expression values represented as log2(TPM + 1). Significant differential expression was determined using the criteria and . To validate the prognostic efficiency of the unveiled HGs and to examine their association with patient outcomes, survival analysis was performed using the TCGA-CESC dataset. The prognostic efficacy of each hub gene was evaluated by stratifying CESC patients into high and low-risk groups on the basis of median gene expression level.
2.9. Gene Regulatory Networks Analysis
Gene regulatory networks (GRNs) define the regulatory relationships between genes, providing insights into their biological roles, including molecular functions and broader functional activities [45]. In this study, we analyzed two types of GRNs: transcription factor (TF)–gene (TFG) interactions and gene–miRNA interactions. TFG interactions illustrate how genes collaborate within regulatory networks and pathways [46]. To analyze TF–gene interactions for the common genes, we used NetworkAnalyst, leveraging the ENCODE database integrated within the platform to establish the TFG network. Similarly, gene—miRNA interactions were also analyzed via NetworkAnalyst.
2.10. Molecular Docking Analysis
Molecular docking (MolDock) was conducted to evaluate the couplings between potential biomarkers and drug commonly used in cervical and ovarian cancer treatment. The 3D frameworks of the corresponding proteins were obtained from the Protein Data Bank (PDB). The molecular structures of proteins were formulated using AutoDock Tools (version 1.5.7) by removing water molecules, incorporating polar hydrogens, and converting them into PDBQT format for docking. For the drug molecule, we retrieved the 3D structures of the Olaparib drug compounds (collected from NCBI) from PubChem in SDF format. The molecules were visualized and adapted into the appropriate configuration using PyMOL (version 3.1.3.1). The binding interactions between the drug and the target proteins were analyzed using MolDock, with visualization performed in PyMOL to reveal critical insights into their potential interactions.
3. Results
3.1. Screening of DEGs
Two microarray datasets were collected from the GEO with the accession IDs GSE63514 for cervical cancer and GSE26712 for ovarian cancer. DE analysis was performed using the limma package in R, applying thresholds of an adj.p-value < 0.05 and to identify DEGs. From the GSE63514 dataset, a total of 1699 DEGs were identified, including 1122 upregulated (upDEGs) and 577 downregulated genes (downDEGs). Similarly, in the GSE26712 dataset, we identified 2013 DEGs, comprising 723 upDEGs and 1290 downDEGs. The corresponding volcano plots and heatmaps illustrating these results are presented in Figure 2.
3.2. Two-Step Gene Selection
In Round 1 of our two-step gene selection approach, mRMR was employed as the primary feature selection method. This step aimed to reduce redundancy by eliminating genes with similar expression patterns while retaining those most relevant to the sample types. By applying mRMR, the top 1200 DEGs from both datasets were selected, ensuring a refined gene subset for Round 2.
In the second round, SVM-RFE was implemented to further optimize the gene subset obtained from Round 1. Through this process, we selected the most substantial gene subset with the top 1000 genes, and refined the feature set for improved classification performance. These genes were considered ML-based DEGs (MDEGs).
As shown in Table 3, the classifiers trained on the gene subsets obtained through two-step feature selection (mRMR+SVM-RFE) achieved strong predictive performance on both datasets. For the cervical cancer dataset (GSE63514), AttLSTM outperformed all other models with 96.15% accuracy, 0.9904 AUC, and a nearly perfect balance between precision (0.9523) and recall (1.000), followed by LightGBM and XGBoost, with accuracies above 88%. For the ovarian cancer dataset (GSE26712), the performance of all classifiers was generally greater, with accuracies exceeding 97% and AUC values close to 1.0. Notably, AttLSTM again achieves the best results, with 99.28% accuracy, 0.9982 F1, and perfect precision and recall, while CNN and XGBoost also yield strong outcomes.
Table 4 presents the comparative performance of mRMR+SVM-RFE against several single-step feature selection methods, including LASSO, ANOVA, MiG, and extra trees, on both datasets. For GSE63514, although all single-step methods achieved noticeable results with accuracies above 90%, mRMR+SVM-RFE provided the best outcome with 96.15% accuracy, 0.9904 AUC, and the highest F1 score (0.9756), alongside perfect recall (1.000). Similarly, for GSE26712, whereas ANOVA and LASSO performed strongly with accuracies of 97.48% and 96.72%, respectively, and near-perfect AUC values, mRMR+SVM-RFE again outperformed all the other methods, achieving 99.28% accuracy, 1.000 AUC, and nearly flawless balance across all the metrics (F1 score, recall, and precision all above 0.998). These results highlight the efficacy of the two-stage gene selection strategy in identifying highly discriminative biomarkers and demonstrate the superior classification ability of AttLSTM over conventional machine learning models.
3.3. WGCNA-Based Gene Selection
This analysis was begun by identifying and removing outliers via hierarchical clustering with the hclust() function. Next, we uncovered the appropriate soft thresholding power ( ) for both datasets, a critical step in constructing a scale-free network for coexpression analysis. As shown in Figure 3, the optimal soft threshold was for GSE63514 and for GSE26712. Next, the respective coexpression networks were constructed with the corresponding dendrograms presented in Figure 4. In the GSE63514 dataset, 16 modules were identified, whereas 6 distinct modules were found in the GSE26712 dataset. After analyzing the module-trait relationships shown in Figure 5, the yellow module in GSE63514 and the brown module in GSE26712 were found to be significantly associated with the trait of interest. The yellow module in GSE63514 contained 1248 genes. The brown module in GSE26712 comprised 925 genes. These genes were considered coexpression-associated genes (CAGs). We then identified the genes common to these CAGs and the MDEGs identified earlier. A total of 8 common genes were found, which were further analyzed for their potential significance.
3.4. Functional Enrichment Analysis
Enrichment analysis was conducted separately on the common MDEGs and CAGs identified in both datasets via the clusterProfiler toolkit in R. This analysis included KEGG pathway and GO enrichment to explore the biological significance of these genes. The enrichment analysis results for MDEGs and CAGs are illustrated in Figure 6. Within the BioProc category shown in Figure 6a, the common MDEGs were significantly associated with mitotic cell cycle phase transition, DNA replication, adaptation to low oxygen conditions, the cellular response to reduced oxygen availability, and control of oxygen homeostasis. As demonstrated in Figure 6b, in the CellCom category, these genes were enriched in chromosomal regions, nuclear chromosomes, blood microparticles, spindle poles, and IgM immunoglobulin complexes. Moreover, in the MoleFunc category (Figure 6c), the key enriched terms included cadherin binding, ubiquitin-like protein ligase binding, ubiquitin protein ligase binding, GTPase binding, and small GTPase binding. In the BioProc category shown in Figure 6d, the most enriched terms for CAGs included nuclear chromosome segregation, positive regulation of the cell cycle process, mitotic cell cycle phase transition, chromosome segregation, and DNA replication. The CellCom enrichment (Figure 6e) highlighted the localization of structural components such as the chromosomal region, chromosome centromeric region, spindle, condensed chromosome, and nuclear chromosome, reinforcing their involvement in mitotic processes. With respect to MoleFunc presented in Figure 6f, the CAGs were significantly associated with helicase activity, single-stranded DNA helicase activity, ATP-dependent activity acting on DNA, catalytic activity acting on DNA, and tubulin binding.
KEGG pathway enrichment analysis of the common MDEGs and CAGs revealed strong associations with DNA replication and cell cycle regulation. Common MDEGs (Figure 7a) were enriched mainly in DNA replication–related processes, including prereplicative complex assembly, DNA polymerase loading, Cdc45 association, and Cyclin A ubiquitination. In contrast, common CAGs (Figure 7b) were enriched in diverse biological modules, such as homologous recombination and DNA repair, cyclin/CDK-mediated cell cycle regulation, and centromere/kinetochore assembly.
3.5. PPIN Construction and HG Identification
A PPIN was constructed for the eight common genes shared between MDEGs and CAGs (presented in Figure 8a) via the NetworkAnalyst database. The network was subsequently visualized and analyzed in Cytoscape. It comprised 168 nodes and 176 edges illustrated in Figure 8b, representing the complex interactions among these genes. The top ten HGs were via the degree method in the cytoHubba feature (presented in Figure 8c). From these, genes exhibiting a minimum of ten interactions were selected for further analysis. This analysis led to the identification of four HGs—FOXM1, MCM3, SH3BP5, and PAPSS2—which exhibited the highest connectivity.
3.6. Validation of Identified HGs
3.6.1. Diagnostic Efficacy Evaluation via ROC Analysis
To validate the diagnostic capability of the identified HGs FOXM1, MCM3, SH3BP5, and PAPSS2, their diagnostic performance was assessed by calculating AUC scores from the ROC curves generated via logistic regression on both cervical and ovarian cancer datasets. As shown in Figure 9, the HGs attained an AUC of 0.902 (95% CI: 0.853–0.941) in the GSE63514 dataset. Similarly, in the GSE26712 dataset, the identified HGs attained an AUC score of 0.995 (95% CI: 0.983–1.000), indicating higher classification accuracy. In the independent test datasets, the unveiled hub genes achieved satisfactory performance, with AUC values of 0.935 (95% CI: 0.876–0.984) and 0.897 (95% CI: 0.845–0.939). These findings indicate the substantial discriminatory power of the pinpointed HGs in effectively distinguishing normal and cancer samples.
3.6.2. mRNA Expression Level and Survival Analysis
The expression levels of the identified HGs (FOXM1, MCM3, SH3BP5, and PAPSS2) were analyzed in cervical cancer and ovarian cancer using TCGA data. The box plots illustrate significant differential expression patterns between normal (N) and tumor (T) samples. FOXM1 (Figure 10a) and MCM3 (Figure 10b) were expressed at significantly higher levels in tumor tissues than in normal tissues, suggesting their potential involvement in tumor progression and proliferation. In contrast, SH3BP5 (Figure 10c) and PAPSS2 (Figure 10d) displayed significantly lower expression in tumor samples, indicating their possible roles as tumor suppressors. In the final phase of validation of the identified hub genes (HGs), survival analysis was employed. As illustrated in Figure 11, the identified HGs, particularly FOXM1 and SH3BP5, exhibited a significant association with poor survival outcomes.
3.7. Gene Regulatory Network Analysis
GRN analysis of the identified HGs was conducted via the NetworkAnalyst. Figure 12 illustrates the gene–miRNA and TFG interaction networks of the identified HGs. The gene–miRNA interaction network (Figure 12a) consists of 90 nodes and 89 edges, depicting the regulatory correlations between key hub genes (FOXM1, SH3BP5, and MCM3) and their associated miRNAs. In this network, nodes in red color represent hub genes, blue nodes indicate miRNAs, and edges signify regulatory interactions, highlighting potential post-transcriptional regulation. The TFG interaction network (Figure 12b) comprises 66 nodes and 68 edges, showcasing interactions between hub transcription factors (FOXM1, PAPSS2, and MCM3) and their target genes. Here, blue nodes represent TFs, green nodes indicate target genes, and edges illustrate regulatory connections.
3.8. Molecular Docking Analysis
Molecular docking (MolDock) exploration was used to examine the interactions between the unveiled HGs and an FDA-approved drug for cervical and ovarian cancer: Olaparib. Binding affinities were evaluated via predefined thresholds, where binding energies below −4.25 kcal/mol suggest potential interactions, those below −5.0 kcal/mol represent moderate binding affinity, and values lower than −7.0 kcal/mol indicate strong binding interactions [47]. The detailed MolDock results are presented in Table 5. MolDock analysis revealed strong binding interactions between the identified HGs (FOXM1, MCM3, SH3BP5, and PAPSS2) and the FDA-approved drug. As shown in Table 5, all the binding energy values were below the threshold of −4.25 kcal/mol, indicating potential binding activity. Notably, Olaparib exhibited stronger binding affinities with all four target proteins, with the lowest binding energy observed for MCM3 (−11.2 kcal/mol), followed by FOXM1 (−9.5 kcal/mol). Similarly, PAPSS2 and SH3BP5 also significantly interact with Olaparib, with binding energies of −8.5 kcal/mol and −8.7 kcal/mol, respectively.
The results suggest that Olaparib has a greater binding affinity for the selected HGs, indicating its potential as a more effective therapeutic option for these proteins in cervical and ovarian cancer. The MolDock configurations are illustrated in Figure 13.
3.9. Discussion
Cervical and ovarian cancers are among the most lethal malignancies affecting the female reproductive system, underscoring the strong demand of pinpointing key biomarkers for early risk prediction and diagnosis. In this study, DEGs were uncovered from two microarray datasets, GSE63514 and GSE26712. Then, the two-step gene selection strategy employed in this study, integrating mRMR followed by SVM-RFE, proved highly effective in refining gene subsets that achieved superior classification accuracy across multiple models and datasets. This approach aligns well with the principles of Healthcare 4.0, which emphasizes precision, connectivity, and the intelligent use of big data for personalized medicine. Feature selection is especially critical in this framework because it mitigates the curse of dimensionality inherent in high-throughput genomic data, reducing redundancy and improving data interpretability. The mRMR method minimizes redundancy while maximizing gene relevance, preparing an optimal input for the subsequent SVM-RFE step, which further refines the subset by eliminating less informative features based on classification performance [12,17].
Classification plays a significant role in evaluating the discriminative power of the refined gene set [29]. In this regard, the attention mechanism can play a vital role to classify effectively. Attention mechanisms provide insight into which genes or features the model focuses on while making predictions. This interpretability is critical in biomedical applications to understand the biological relevance of key biomarkers [17,48,49]. By dynamically allocating weights to the most informative and relevant genes, attention helps distinguish subtle differences in gene expression patterns that may be pivotal for cancer prognosis and diagnosis. Moreover, the attention mechanism enhances model robustness by capturing complex, long-range dependencies within high-dimensional gene expression data, allowing for personalized feature interactions that differ among patients. This dynamic weighting not only improves predictive accuracy but also facilitates the identification of novel biomarkers by highlighting influential genes within the dataset [17,50], thereby aligning well with the goals of precision medicine in the Healthcare 4.0 paradigm. In our study, we used several classifiers, including XGBoost, AdaBoost, LightGBM, CNN, and AttLSTM, where AttLSTM outperformed other classifiers by effectively leveraging the attention mechanism to capture critical gene interactions and achieve superior discriminative power. To capture complex, nonlinear relationships among gene features, the AttLSTM network was utilized in our study, wherein the recurrent LSTM units learn contextual dependencies in the data and the attention mechanism prioritizes the most informative features during training, thereby enhancing the model’s representational capacity and classification performance. Using our proposed two-step feature selection method combined with an AttLSTM model, we identified a refined set of 1000 MDEGs.
Concurrently, WGCNA was implemented to identify coexpression-associated genes (CAGs). In the GSE63514 dataset, 16 modules were detected, with the yellow module containing 1248 CAGs being the most significant. Similarly, in the GSE26712 dataset, six modules were identified, among which the brown module, comprising 925 CAGs, showed the strongest association with the trait of interest. Enrichment analysis, including GO and KEGG pathway analyses, was conducted separately on the common MDEGs and CAGs. Both groups exhibited significant enrichment in various functional categories, illustrated in Figure 6 and Figure 7. Eight common genes were subsequently identified between all MDEGs and CAGs, as shown in Figure 8a.
A PPIN was then established via the NetworkAnalyst data repository and further visualized and analyzed in Cytoscape (Figure 8b). From this network, the top ten HGs were uncovered employing the degree centrality offered by the cytoHubba plugin, depicted in Figure 8c. Among these, genes with at least ten interactions were selected, resulting in the identification of four key HGs: FOXM1, MCM3, SH3BP5, and PAPSS2.
Forkhead Box M1 (FOXM1) is a critical regulator of the cell cycle and a member of the Forkhead transcription factor family, playing a vital role in embryonic development, cell proliferation, and cancer progression. As a proliferation-associated transcription factor, FOXM1 is present in all dividing cell types, including tumor cell lines, and has been closely linked to multiple solid tumors [51,52]. Its oncogenic properties allow it to influence a broad range of biological processes, emphasizing its significant contribution to tumor initiation and advancement [53]. It has been linked to several other cancers, including hepatocellular carcinoma (HCC) [53], small cell lung cancer [52], colorectal cancer [51], breast cancer [54], pancreatic cancer and esophageal cancer [55], and lung adenocarcinoma [56]. FOXM1 was also found to be overexpressed in cervical [57] and ovarian cancers [58], suggesting its involvement in their progression and development.
Minichromosome maintenance proteins (MCMs) are a group of six homologous proteins (MCM2-7) that play crucial roles in maintaining minichromosomes, initiating DNA replication, and regulating cell proliferation. Among them, MCM3 not only is cruicial for the initiation of DNA replication but also serves as a key marker for cellular proliferation [59,60]. MCM3 has been shown to be excessively expressed in cervical [61] and ovarian cancers [62], as well as in other malignancies such as HCC [60,63,64], NSCLC [59], and colorectal cancer [65]. Consistently, our study also revealed elevated expression of MCM3 in cervical and ovarian cancer samples compared to adjacent normal tissues. Both FOXM1 and MCM3 were significantly upregulated in cancer samples, whereas SH3BP5 and PAPSS2 were markedly downregulated in cancer samples compared with adjacent normal tissues. These HGs also achieved significant AUC scores in both the GSE63514 and GSE26712 datasets. Gene regulatory network (GRN) analysis provided critical insights into the regulatory mechanisms governing the identified hub genes. The gene–miRNA interaction network revealed key transcriptional regulatory elements, with FOXM1, SH3BP5, and MCM3 being significantly regulated by multiple miRNAs. Similarly, the TFG interaction network highlighted essential regulatory interactions, where hub TFs such as FOXM1, PAPSS2, and MCM3 were observed to control multiple target genes. Finally, these genes exhibited potential binding affinity with an FDA-approved drug compound, Olaparib. These observations imply that FOXM1, MCM3, SH3BP5, and PAPSS2 may serve as key biomarkers for early prognosis and diagnosis, as well as potential therapeutic targets for cervical and ovarian cancer.
4. Conclusions
In this study, key genes associated with cervical and ovarian cancer were pinpointed by integrating bioinformatics and machine learning approaches. Through differential expression analysis, ML-based feature selection, and WGCNA, we identified FOXM1, MCM3, SH3BP5, and PAPSS2 as potential biomarkers. These genes exhibited significant diagnostic potential, achieving high AUC scores during validation and displaying distinct expression patterns in cancer samples. Furthermore, drug–gene interaction analysis revealed their potential as therapeutic targets, demonstrating their binding affinity with an FDA-approved drug. Overall, these findings may enhance understanding of the molecular mechanisms driving cervical and ovarian cancer, highlighting promising biomarkers for early diagnosis and targeted therapy. Future research can utilize larger datasets and incorporate advanced machine learning techniques to further validate and refine these findings.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Elias M.H. Das S. Abdul Hamid N. Candidate genes and pathways in cervical cancer: A systematic review and integrated bioinformatic analysis Cancers 20231585310.3390/cancers 1503085336765810 PMC 9913780 · doi ↗ · pubmed ↗
- 2Xue H. Sun Z. Wu W. Du D. Liao S. Identification of hub genes as potential prognostic biomarkers in cervical cancer using comprehensive bioinformatics analysis and validation studies Cancer Manag. Res.20211311713110.2147/CMAR.S 28298933447084 PMC 7802793 · doi ↗ · pubmed ↗
- 3Sudha B. Poornima A. Suganya K. Swathi K. Kumar N.S. Sumathi S. Chellapandi P. Identification of hub genes and role of CDKN 2A as a biomarker in cervical cancer: An in-silico approach Hum. Gene 20223320104810.1016/j.humgen.2022.201048 · doi ↗
- 4Zhao Y. Pi J. Liu L. Yan W. Ma S. Hong L. Identification of the hub genes associated with the prognosis of ovarian cancer patients via integrated bioinformatics analysis and experimental validation Cancer Manag. Res.20211370772110.2147/CMAR.S 28252933542655 PMC 7851396 · doi ↗ · pubmed ↗
- 5Yang D. He Y. Wu B. Deng Y. Wang N. Li M. Liu Y. Integrated bioinformatics analysis for the screening of hub genes and therapeutic drugs in ovarian cancer J. Ovarian Res.20201311810.1186/s 13048-020-0613-2PMC 698607531987036 · doi ↗ · pubmed ↗
- 6Azangou-Khyavy M. Ghasemi E. Rezaei N. Khanali J. Kolahi A.A. Malekpour M.R. Heidari-Foroozan M. Nasserinejad M. Mohammadi E. Abbasi-Kangevari M. Global, regional, and national quality of care index of cervical and ovarian cancer: A systematic analysis for the global burden of disease study 1990–2019 BMC Women’s Health 2024246910.1186/s 12905-024-02884-938273304 PMC 10809627 · doi ↗ · pubmed ↗
- 7Yang X. Zhou M. Luan Y. Li K. Wang Y. Yang X. Identification of key genes associated with cervical cancer based on bioinformatics analysis BMC Cancer 20242489710.1186/s 12885-024-12658-z 39060960 PMC 11282596 · doi ↗ · pubmed ↗
- 8Karunakara S.H. Eswaran S. Mallya S. Suresh P.S. Chakrabarty S. Kabekkodu S.P. Analysis of mi R-497/195 cluster identifies new therapeutic targets in cervical cancer BMC Res. Notes 20241721710.1186/s 13104-024-06876-839095857 PMC 11297691 · doi ↗ · pubmed ↗
