Integrative analysis of efferocytosis- and invasion-related genes as potential biomarkers and therapeutic targets in breast cancer

Jing Yang; Rong Zhang; Lamei Sun; Cong Wang; Ming Feng; Bin Su; Lixin Jiang

PMC · DOI:10.1007/s12672-025-03346-w·August 5, 2025

Integrative analysis of efferocytosis- and invasion-related genes as potential biomarkers and therapeutic targets in breast cancer

Jing Yang, Rong Zhang, Lamei Sun, Cong Wang, Ming Feng, Bin Su, Lixin Jiang

PDF

Open Access

TL;DR

This study identifies genes related to efferocytosis and invasion that could serve as biomarkers and therapeutic targets for breast cancer.

Contribution

A novel prognostic risk model using efferocytosis- and invasion-related genes is developed for breast cancer.

Findings

01

32 efferocytosis- and invasion-related genes, including ANO6 and PLGRKT, showed differential expression in breast cancer.

02

A risk score formula using ANO6 and PLGRKT predicted survival outcomes with significant differences between high- and low-risk groups.

03

ANO6 and PLGRKT were linked to biological processes like bleb assembly and phagocytosis regulation.

Abstract

Breast cancer remains a primary source of cancer-related mortality among females worldwide. This investigation sought to evaluate the distinctive expression patterns of genes linked to efferocytosis and invasion in breast cancer and their prognostic implications. Through bioinformatics analyses, a robust prognostic risk modelwas developed. Breast cancer datasets were procured and processed from The Cancer Genome Atlas (TCGA), Gene Expression Omnibus (GEO), and GeneCards databases. Differential expression analysis was executed utilizing DESeq2, identifying genes with|logFC| >1 and p-value < 0.05. Efferocytosis-related genes and invasion-related genes were curated from GeneCards and PubMed, resulting in 127 overlapping genes. A prognostic risk model was developed utilizing univariate Cox regression, Least Absolute Shrinkage and Selection Operator regression, and multivariate Cox…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes6

ANO6 PLGRKT KEAP1 BRCA1 TP53 NFE2L2

Proteins6

Species1

Homo sapiens(human · species)

Diseases2

breast cancer Cancer

Figures11

Click any figure to enlarge with its caption.

Differential Expression Validation and ROC Curve Analysis. A Model Genes comparison visualization between High Risk and Low Risk cohorts within BRCA specimens from TCGA-BRCA dataset. B, C Model Genes *ANO6* (B), *PLGRKT* (C) ROC analysis in TCGA-BRCA dataset specimens. * denotes p value < 0.001, indicating substantial statistical significance. AUC exceeding 0.5 suggests molecular expression’s propensity toward event occurrence. AUC proximity to 1 indicates enhanced diagnostic efficacy, with values between 0.7–0.9 demonstrating notable precision. Orange designates High Risk cohort while light b

Differential Expression Validation and ROC Curve Analysis. A Model Gene comparison plots between High Risk and Low Risk categories in BRCA specimens from Combined GEO Datasets. B, C ROC analyses of Model Genes *ANO6* (B), *PLGRKT* (C) in BRCA specimens from integrated GEO Datasets (Combined Datasets). denotes p value < 0.01, indicating substantial statistical significance; * represents p value < 0.001, demonstrating exceptional statistical significance. AUC values exceeding 0.5 suggest molecular expression’s promotion of events, with values approaching 1 indicating superior diagnostic outcomes

Differential Gene Expression Analysis. A Volcano plot of DEGs analysis between breast cancer (BRCA) cohort and Control (Control) cohort in the Breast Cancer Dataset (TCGA-BRCA). B DEG and EIRGs in the Breast Cancer dataset (TCGA-BRCA), Venn diagram of all genes in the Combined GEO Datasets. C Heat map of EIRDEGs in the breast cancer dataset (TCGA-BRCA). In the heat map grouping, blue is the Control (Control) cohort, and yellow is the BRCA cohort. In the heat map, red denotes elevated expression, and blue indicates reduced expression

Cox Regression Analysis. A Forest Plot of 2 effemination and invasion associated differentially expressed genes (EIRDEGs) in univariate Cox regression model. B, C PRM plot (B) and variable trajectory plot (C) from the LASSO regression model

Prognostic analysis. A Temporal ROC curves for BRCA specimens in the breast cancer dataset (TCGA-BRCA). B Prognostic KM curves comparing high and low RiskScore cohorts and OS of BRCA specimens. C, D Forest Plot depicting RiskScore and clinical parameters in univariate Cox regression model (C) and multivariate Cox regression model (D). E Nomogram illustrating RiskScore and clinical parameters in univariate and multivariate Cox regression model. AUC values exceeding 0.5 suggest molecular expression’s inclination toward event occurrence. AUC values approaching 1 indicate enhanced diagnostic effec

Prognostic Analysis. A–C 3-year (A), 4-year (B), and 5-year (C) Calibration curves of the BRCA PRM. D–F Prognosis of BRCA risk model 3 years (D), 4 years (E), 5 years (F) decision curve analysis (DCA)

GO enrichment analysis for model genes. A The scatter visualization of GO enrichment analysis findings for Model Genes illustrated: BP, CC, MF, with GO terms on the vertical axis. B–D Model Genes (BP (B), CC (C), MF (D)) gene ontology (GO) enrichment analysis network illustration. Pink vertices represent entries, blue vertices indicate molecules, while connections demonstrate entry-molecule relationships. Within the scatter plot, circle dimensions reflect gene quantities, while color gradients indicate p-value magnitude. Deeper blue signifies lower p-values, while redder hues represent larger

GSEA for Risk Cohort. A Volcano plot of DEGs analysis between High Risk cohort and Low Risk cohort in BRCA samples in the Breast Cancer Dataset (TCGA-BRCA). B logFC ranked Top10 up-regulated and Top10 down-regulated DEGs heatmap of BRCA samples in High Risk cohort and Low Risk cohort. C Display of 4 biological function bubble plots from GSEA of Breast cancer dataset (TCGA-BRCA). D–G GSEA depicted that genes in the breast cancer dataset (TCGA-BRCA) were markedly enriched in KEAP1 NFE2L2 Pathway (D), Nuclear Events Mediated By NFE2L2 (E), KEAP1 NFE2L2 pathway (D), and nuclear events mediated by

GSVA Analysis. A, B Heat map (A) and comparative visualization (B) of gene set variant analysis (GSVA) findings between High Risk and Low Risk cohorts from TCGA-BRCA dataset specimens. * denotes p value < 0.001, indicating substantial statistical significance. Orange indicates the High Risk cohort while blue signifies the Low Risk cohort. The selection parameters for gene set variation analysis (GSVA) utilized p-value < 0.05, with Benjamini-Hochberg (BH) adjustment for p-value correction. In the heat map, blue indicates minimal enrichment while red signifies substantial enrichment

PPI Network Analysis. GeneMANIA website forecasts interaction networks of functionally comparable Genes relative to two Model Genes. Within the illustration, circles represent Model Genes and their functionally similar counterparts, while line colors signify their interconnected functional relationships

Funding1

—Scientific research project of Wuxi Municipal Health Commission

Keywords

Breast cancerBioinformaticsEfferocytosisInvasionBiomarkerDifferentially expressed genes

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhagocytosis and Immune Regulation · Immune cells in cancer · Epigenetics and DNA Methylation

Full text

Introduction

Breast cancer (BRCA) remains a principal source of cancer-related mortality among females worldwide, recording roughly 2.31 million newly diagnosed instances and 665,684 fatalities in 2022. The global burden is expected to rise further, markedly impacting patients’ quality of life and survival rates [1, 2]. Standard treatments, such as surgery, radiation, chemotherapy, and hormone therapy, have considerably improved survival outcomes [3]. However, these therapies often cause substantial side effects and exhibit limited effectiveness across all patient populations, underscoring the need for more precise and personalized therapeutic approaches.

An increasing body of research highlights the role of efferocytosis- and invasion-related genes in cancer. Efferocytosis refers to the process by which phagocytic cells clear and degrade apoptotic cells, a pathway capable of enabling tumor-associated macrophages (TAMs) to shift toward the M2 phenotype, thereby promoting tumor progression [4]. A study highlights that the efferocytosis-related gene (IL33) influences BRCA progression by regulating cytotoxicity and metabolic pathways in CD8 + T cells. IL33, identified as a key prognostic marker, suppresses tumor growth and enhances immune-mediated tumor destruction, underscoring the dual role of efferocytosis in both promoting immunosuppression and revealing therapeutic vulnerabilities [5]. Invasion-related genes are integral to cancer cell metastasis, a key factor in cancer-related mortality [5, 6]. Another study reveals that Collagen I triggers a force-dependent Yap activation loop in basal-like breast cancer cells, independent of K14 expression. Yap drives transcriptional programs that enhance Collagen I alignment and tension, creating a feed-forward cycle to sustain collective invasion. Active Yap is critical for leader cell selection and invasion in 3D models and mouse mammary fat pads, highlighting its role as a central mechanotransducer in metastasis [6]. Dysregulation of these genes has been linked to poor prognosis in diverse cancers, encompassing colorectal, lung, and ovarian cancers [7–10]. However, their specific roles and clinical implications in BRCA remain poorly defined, necessitating further investigation.

This study employed a bioinformatics approach to examine the differential expression of efferocytosis- and invasion-related genes in BRCA and their prognostic relevance. TCGA, GEO, and GeneCards datasets were employed to ascertain differentially expressed genes (DEGs) between BRCA and control specimens. A prognostic risk model (PRM) was established utilizing Cox and LASSO regression analyses, and its effectiveness was confirmed through Receiver Operating Characteristic (ROC) curves and Kaplan-Meier (KM) survival analysis. Functional enrichment analyses (FEAs) were implemented to uncover the molecular processes linked to these genes, and protein-protein interaction (PPI) networks were constructed to examine their interrelationships.

Our results demonstrated that the identified genes are critical in BRCA advancement and clinical outcomes. The PRM exhibited strong precision in forecasting patient outcomes, indicating its potential clinical utility. FEA suggested that these genes are implicated in key biological processes (BPs) and pathways, including cell cycle regulation, apoptosis, and immune response, all essential for cancer development and progression [11–13]. The PPI network analysis further emphasized the complex interactions among these genes, offering potential insights into novel therapeutic interventions.

Methodologies and materials

Data collection

The BRCA information was procured through The Cancer Genome Atlas (TCGA) portal (https://portal.gdc.cancer.gov/) [14] utilizing the TCGAbiolinks R package (Version 2.30.0) [15]. Following removing cases with incomplete clinical records, the analysis incorporated 1062 BRCA specimens and 110 non-tumor specimens. The information underwent standardization to FPKM (Fragments Per Kilobase per Million) format, while associated clinical records were extracted from the UCSC Xena platform (https://xena.ucsc.edu/) [16]. Table 1 provides detailed information.

Table 1. Baseline table with BRCA patients characteristicsCharacteristicsOverall MStage, n (%) M0892 (84%)M1&MX170 (16%) NStage, n (%) N0513 (48.3%)N1357 (33.6%)N2116 (10.9%)N376 (7.2%) TStage, n (%) T1268 (25.2%)T2623 (58.7%)T3138 (13%)T433 (3.1%)

Additionally, the BRCA datasets GSE42568 [17] and GSE29431 were downloaded from the GEO database (https://www.ncbi.nlm.nih.gov/geo/) [18] utilizing the R package GEOquery (Version 2.70.0) [19]. The analyzed specimens within these datasets originated from Homo sapiens breast tissue. Comprehensive dataset characteristics are depicted in Table 2. The GSE42568 dataset, based on the GPL570 platform, included 104 BRCA samples and 17 control samples, while the GSE29431 dataset, also utilizing the GPL570 platform, contained 54 BRCA specimens and 12 control specimens. All specimens were incorporated into the analysis.

Table 2GEO microarray chip informationGSE42568 GSE29431PlatformGPL570 GPL570SpeciesHomo sapiensHomo sapiensTissueBreastBreastSamples in BRCA group10454Samples in control group1712ReferencePMID: 23,740,839\

The R package sva initially addressed batch effects within BRCA datasets GSE42568 and GSE29431, yielding merged GEO datasets. To verify batch effect elimination efficacy, distribution boxplots illustrated expression values prior to and following batch effect adjustment. Furthermore, PCA plots demonstrated low-dimensional feature distribution before and after batch effect elimination. The distribution boxplot and PCA visualizations confirmed successful batch effect removal from the BRCA datasets. Detailed information is depicted in Figure S1.

The GeneCards database (https://www.genecards.org/) [20], an extensive resource for human gene information, was employed to ascertain efferocytosis- and invasion-related genes (EIRGs). Using the keywords “Efferocytosis” and “Invasion,” 55 efferocytosis-related genes (ERGs) were initially obtained by filtering for “Protein Coding” genes with a relevance score > 2. Similarly, 2233 invasion-related genes (IRGs) were identified using the same criteria. Additionally, a search for “Efferocytosis” in PubMed (https://pubmed.ncbi.nlm.nih.gov/) [21–23] resulted in 324 ERGs, while searching for “Invasion” yielded 1,029 IRGs [24–26]. After merging and removing duplicates, 326 ERGs and 3,042 IRGs were obtained. The overlap between these collections produced 127 EIRGs. Detailed information is depicted in Table S1.

The sva R package (Version 3.50.0) [27] was utilized to eliminate batch variations across the GSE42568 and GSE29431 datasets, generating the Merged GEO datasets. This consolidated information encompassed 158 BRCA specimens and 29 normal specimens. The unified GEO information underwent normalization through the limma R package (Version 3.58.1) [28], with probe annotations being standardized. To assess the batch correction efficiency, principal component analysis (PCA) [29] was executed on expression matrices before and after batch adjustment. PCA functions as a dimension reduction approach that captures essential characteristics from multidimensional information, converting it to lower dimensions and displaying these elements in 2D or 3D representations for verification.

BRCA-associated efferocytosis- and invasion-associated DEGs

Specimens from the TCGA-BRCA database were split between BRCA and reference populations. Gene expression variance examination was performed utilizing DESeq2 R software [30] (Version 1.42.0) to evaluate both cohorts. Differentially expressed genes were selected using|logFC| >1 criteria and p-value < 0.05. Genes exhibiting logFC > 1 and p-value < 0.05 were designated elevated DEGs, whereas those showing logFC < −1 and p-value < 0.05 were classified as reduced DEGs. The analytical outcomes were depicted in a volcano visualization created with ggplot2 R software (Version 3.4.4).

To identify efferocytosis- and invasion-related DEGs (EIRDEGs), the DEGs from the TCGA-BRCA dataset (|logFC| >1, p-value < 0.05) were intersected with EIRGs. A Venn diagram was constructed to identify the overlap, and the resulting EIRDEGs were illustrated via a heatmap developed by the R package R package (Version 1.0.12).

Development of BRCA PRM

To construct the PRM in TCGA, univariate and multivariate Cox regression analyses were executed employing the R package survival (Version 3.5-7) [31], incorporating clinical information to evaluate the prognostic impact of EIRDEGs. To determine whether these EIRDEGs were independent prognostic factors, univariate Cox regression analysis was first executed. Variables with a p-value < 0.1 were subsequently incorporated into the multivariate Cox regression analysis.

LASSO regression analysis was subsequently executed utilizing the glmnet R package [32] (Version 4.1-8) with the parameter family = “cox” based on the EIRDEGs identified in the univariate Cox regression analysis. Ten iterations were implemented in the process. This methodology facilitated PRM gene selection. LASSO regression, which extends beyond traditional linear regression, incorporates a penalization component (lambda × absolute value of slope) to minimize model overadaptation and strengthen its broad applicability. The analytical outcomes were depicted through both the PRM illustration and the variable progression graph. The ultimate risk calculation utilized this equation, derived from LASSO regression coefficients:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\text{r}\text{i}\text{s}\text{k}Score\:=\:\sum\:_{i}Coefficient\:\left({gene}_{i}\right)\text{*}mRNA\:Expression\:\left({gene}_{i}\right)$$\end{document}

Prognostic analysis of BRCA PRM

To evaluate the PRM’s effectiveness, determine optimal models, eliminate inferior alternatives, and establish ideal thresholds within individual models, a time-dependent Receiver Operating Characteristic (ROC) curve [33] was implemented. The survivalROC R package (Version 1.0.3.1) facilitated the creation of time-dependent ROC curves and Area Under the Curve (AUC) calculations, utilizing risk scores and overall survival (OS) data from the TCGA-BRCA dataset’s BRCA specimens. This evaluation focused on predicting survival outcomes at 3-, 4-, and 5-year intervals. AUC values spanned from 0.5 to 1, with higher values signifying enhanced diagnostic capability. Values exceeding 0.5 suggested that molecular expression facilitated event occurrence; ranges of 0.5–0.7 indicated minimal accuracy, 0.7–0.9 suggested moderate accuracy, while values above 0.9 demonstrated high accuracy. The survival R package [31] (Version 3.5-7) was employed for KM curve [34] analysis to examine OS differences between high and low-risk cohorts, with the KM curve generated from risk score data.

The outcomes of univariate and multivariate Cox regression examinations were depicted through a forest plot, demonstrating risk score expression and clinical parameter patterns in both analyses. A Nomogram [35], which portrays the correlations among various independent variables through separate line segments, was established utilizing the rms R package (Version 6.7-1), derived from multivariate Cox regression outcomes, emphasizing the connections between risk scores and clinical parameters.

A Calibration Curve illustrated the framework’s predictive precision by contrasting actual outcomes against model-projected probabilities across various scenarios. The calibration examination evaluated the PRM’s precision and differentiation capabilities using multivariate Cox risk calculations. Furthermore, Decision Curve Analysis (DCA) [36], a practical approach for assessing clinical prediction frameworks, diagnostic evaluations, and molecular indicators, was implemented through the ggDCA R package (Version 1.1) to generate DCA visualizations, providing additional validation of the BRCA PRM’s precision and discriminatory power.

Gene ontology (GO) and pathway (KEGG) enrichment analysis

GO analysis [37] represents a prevalent method for extensive functional enrichment investigations, incorporating BP, Cellular Component (CC), and Molecular Function (MF). The KEGG [38] functions as a leading repository, offering extensive data on genomes, biological pathways, diseases, and drugs. For GO and KEGG enrichment analysis, we used ANO6 and PLGRKT as input gene sets. These two genes are differentially expressed genes (EIRDEGs) associated with efferocytosis and invasion identified by LASSO regression. To ensure the accuracy and biological relevance of the enrichment analysis results, we selected a screened gene set as the background genes. The background gene set was obtained from the R package “org.Hs.eg.db”, which provides relatively comprehensive gene function annotation information. GO and KEGG pathway enrichment evaluations were conducted on the screened Model Genes utilizing the clusterProfiler R package [39] (Version 4.10.0), implementing a p-value < 0.05 selection standard.

Gene set enrichment analysis (GSEA)

The BRCA dataset (TCGA-BRCA) underwent stratification into high-risk and low-risk cohorts utilizing median risk scores. Gene expression variance analysis utilized DESeq2 (Version 1.42.0), with findings displayed through volcano plots generated via ggplot2 R package (Version 3.4.4), emphasizing genes meeting|logFC| >2 and p-value < 0.05 criteria. The pheatmap R package (Version 1.0.12) generated heat maps displaying expression levels of genes satisfying|logFC| >2 and p-value < 0.05 parameters. GSEA examination encompassed all genes within BRCA specimens utilizing clusterProfiler R package (Version 4.10.0) [39]. GSEA configurations incorporated a 2020 seed, establishing gene set parameters between 10 and 500 genes. Gene sets were acquired from the Molecular Signatures Database (MSigDB), specifically utilizing c2.cp.all.v2022.1.hs.symbols.gmt data. GSEA selection utilized a p-value < 0.05 threshold.

Gene set variation analysis (GSVA)

GSVA [40] functions as an unsupervised, distribution-free technique that quantifies gene set enrichment across specimens by transforming gene expression data into sample-specific pathway scores. This methodology determines whether particular pathways exhibit differential enrichment among various specimens. Gene sets were obtained from MSigDB [41], specifically utilizing the c2.cp.v2023.2.hs.symbols.gmt file. GSVA implementation utilized the GSVA R package (Version 1.50.0) on the complete gene collection from the TCGA-BRCA dataset’s BRCA specimens, emphasizing pathway enrichment distinctions between elevated and reduced risk categories. GSVA selection criteria employed a p-value threshold of 0.05, implementing Benjamini-Hochberg (BH) adjustment for p-value correction.

PPI network

The PPI network encompasses interacting proteins that execute essential roles in BPs, encompassing signal transduction, genetic expression control, metabolic processes, and cell cycle modulation. A comprehensive examination of protein interactions within biological systems is essential for comprehending protein roles, biological signaling mechanisms, and metabolic pathways under various physiological states, encompassing pathological conditions. This investigation also helps illuminate protein functional associations. The GeneMANIA database [42] (https://genemania.org/) generates functional gene hypotheses, examines gene collections, and ranks genes for subsequent functional investigation. When given query genes, GeneMANIA discovers functionally comparable genes by analyzing extensive genomics and proteomics information, assigning weights to datasets based on query gene predictive value. Furthermore, GeneMANIA forecasts gene functions by identifying genes likely to share functional characteristics with query genes utilizing their interactions. The PPI network was established by identifying functionally similar genes to the Model Genes through the GeneMANIA platform.

Differential expression verification and ROC curve analysis

BRCA samples from the TCGA-BRCA dataset were split into high-risk and low-risk cohorts utilizing the median value of the risk score procured from the prognosis model. To examine the distinctions in Model Gene expression among these cohorts, a comparative visualization was generated utilizing the Model Gene’s expression levels. The diagnostic effectiveness of Model Gene expression was assessed through ROC curve visualization and AUC calculation utilizing the R package pROC [43] (Version 1.18.5). The AUC metric typically extends from 0.5 to 1, with elevated values denoting superior diagnostic capability. An AUC surpassing 0.5 suggests that molecular expression might facilitate event occurrence, with values of 0.5–0.7 denoting minimal accuracy, 0.7–0.9 suggesting intermediate accuracy, and above 0.9 representing superior accuracy.

Risk scores for BRCA specimens from consolidated GEO datasets were ascertained through risk coefficient calculations. BRCA specimens were categorized into high-risk and low-risk cohorts utilizing median risk score values. To examine Model Gene expression variations between these categories, a comparative expression visualization was developed. The diagnostic utility of Model Gene expression underwent evaluation through ROC curve visualization and AUC calculation utilizing the R package pROC (Version 1.18.5).

Statistical analysis

R software (Version 4.3.0) facilitated data manipulation and examination. When evaluating continuous variables across two cohorts, the independent Student’s t-test determined statistical significance for normally distributed data, except where noted otherwise. Non-normally distributed variables underwent Mann-Whitney U test (Wilcoxon Rank Sum test) analysis. The Kruskal-Wallis test examined differences among three or more cohorts. Correlation coefficients between molecular variables were established through Spearman correlation analysis. Statistical evaluations employed two-sided testing unless specified differently, with significance established at p < 0.05.

Results

Technology roadmap

The process of integrative analysis of EIRDEGs is shown in Fig. 1.

Fig. 1. Flow chart for the analysis of EIRDEGs

BRCA-related efferocytosis- and invasion-related DEGs

Subsequently, the BRCA dataset (TCGA-BRCA) was categorized into BRCA and Control cohorts. Gene expression distinctions between these cohorts were examined. This investigation identified two DEG collections: 7,860 DEGs satisfied the parameters of|logFC| >1 and p-value < 0.05. Within these, 4,130 genes exhibited increased expression (logFC > 1 and p-value < 0.05), while 11,990 genes showed decreased expression (logFC < −1 and p-value < 0.05). The findings were illustrated through a volcano plot (Fig. 2A).

To detect EIRDEGs, all DEGs exhibiting|logFC| >1 and p-value < 0.05 were cross-referenced with EIRGs. A Venn diagram (Fig. 2B) was plotted to compare the DEGs with the EIRGs in the Combined GEO Datasets, revealing 32 EIRDEGs. Details regarding these genes appear in Table S2. Following this, the expression differences of the EIRDEGs in the TCGA-BRCA dataset were examined, and a heatmap was developed to visually present the outcomes (Fig. 2C).

Fig. 2. Differential Gene Expression Analysis. A Volcano plot of DEGs analysis between breast cancer (BRCA) cohort and Control (Control) cohort in the Breast Cancer Dataset (TCGA-BRCA). B DEG and EIRGs in the Breast Cancer dataset (TCGA-BRCA), Venn diagram of all genes in the Combined GEO Datasets. C Heat map of EIRDEGs in the breast cancer dataset (TCGA-BRCA). In the heat map grouping, blue is the Control (Control) cohort, and yellow is the BRCA cohort. In the heat map, red denotes elevated expression, and blue indicates reduced expression

Construction of BRCA PRM

For BRCA PRM development, univariate Cox regression analysis was executed on the 32 EIRDEGs. Variables with a p-value < 0.05 in the univariate analysis were visualized utilizing a Forest plot (Fig. 3A). The results identified two statistically significant EIRDEGs (p-value < 0.05), denoted as ANO6 and PLGRKT. To further examine the prognostic value of these genes in BRCA, LASSO regression analysis was conducted, and a LASSO regression model was developed. The LASSO model and variable trajectory plot (Fig. 3B-C) were generated for visualization. The LASSO regression model included two genes: ANO6 and PLGRKT. The risk score was computed utilizing the following formula: risk score = ANO6 × (0.328) + PLGRKT × (− 0.277).

Fig. 3. Cox Regression Analysis. A Forest Plot of 2 effemination and invasion associated differentially expressed genes (EIRDEGs) in univariate Cox regression model. B,** C** PRM plot (B) and variable trajectory plot (C) from the LASSO regression model

Prognostic analysis of BRCA PRM

Time-dependent ROC curves (Fig. 4A) were generated for BRCA specimens within the TCGA-BRCA dataset. Findings demonstrated that the BRCA PRM displayed minimal precision (0.7 > AUC > 0.5) at 3, 4, and 5 years. Furthermore, KM curve examination (Fig. 4B) utilizing median OS categorization for BRCA specimens in the risk score-Combined Breast Cancer Dataset (TCGA-BRCA) exhibited substantial statistical distinctions between elevated and reduced risk categories (p < 0.01).

Following this, singular Cox regression examination incorporated risk metrics and clinical indicators to evaluate their overall survival (OS) prognostic significance in BRCA specimens. Parameters yielding p-values < 0.10 underwent multiple Cox regression examinations. Both singular and multiple Cox regression outcomes were illustrated through Forest plots (Fig. 4C, D) and detailed in Table 3. Multiple Cox regression examinations identified risk metrics, Age, NStage subcategories (N1, N2, and N3), and TStage subcategory (T4) as substantially significant indicators (p < 0.01).

Table 3. Results of Cox analysisCharacteristicsTotal(N)Univariate analysisMultivariate analysisHR (95% CI)P valueHR (95% CI)P valueAge10621.032 (1.019–1.046)< 0.0011.037 (1.023–1.051)< 0.001MStage1062M0892ReferenceReferenceM1&MX1701.655 (1.055–2.598)0.0281.154 (0.717–1.860)0.555NStage1062N0513ReferenceReferenceN13571.919 (1.304–2.825)< 0.0011.909 (1.269–2.869)0.002N21162.479 (1.459–4.212)< 0.0012.504 (1.426–4.395)0.001N3764.128 (2.283–7.461)< 0.0013.046 (1.578–5.880)< 0.001TStage1062T1268ReferenceReferenceT26231.515 (0.977–2.350)0.0641.405 (0.890–2.219)0.145T31381.785 (1.028–3.099)0.0401.304 (0.726–2.342)0.375T4334.488 (2.292–8.788)< 0.0012.261 (1.101–4.644)0.026Risk.Score10622.805 (1.524–5.163)< 0.0012.943 (1.521–5.694)0.001HR hazard ratio

To enhance understanding of the risk model’s prognostic significance in BRCA, a Nomogram was developed incorporating both univariate and multivariate Cox regression analytical outcomes. This framework demonstrates the correlation between risk scores and four clinical parameters in BRCA specimens (Fig. 4E). The assessment indicated the prognostic significance of risk score and Age substantially exceeded other clinical parameters. In contrast, MStage, NStage, and TStage exhibited considerably lower utility in the PRM compared to the risk score.

Fig. 4. Prognostic analysis. A Temporal ROC curves for BRCA specimens in the breast cancer dataset (TCGA-BRCA). B Prognostic KM curves comparing high and low RiskScore cohorts and OS of BRCA specimens. C,** D** Forest Plot depicting RiskScore and clinical parameters in univariate Cox regression model (C) and multivariate Cox regression model (D). E Nomogram illustrating RiskScore and clinical parameters in univariate and multivariate Cox regression model. AUC values exceeding 0.5 suggest molecular expression’s inclination toward event occurrence. AUC values approaching 1 indicate enhanced diagnostic effectiveness, while values between 0.5–0.7 demonstrate limited precision. P values below 0.01 were deemed highly statistically significant

Additionally, a calibration analysis was conducted for the 3-year, 4-year, and 5-year prognosis of the BRCA risk model, with Calibration Curves plotted (Fig. 5A–C). The horizontal axis of the Calibration Curve depicts model-predicted survival likelihood, while the vertical axis shows the actual survival probability. The lines representing the predicted survival probabilities at different time points closely aligned with the ideal gray line, indicating a better prediction at these time points. The outcomes demonstrated that the 5-year prediction had the best clinical performance for the BRCA risk model. Finally, DCA was utilized to examine the clinical effectiveness of the PRM at 3 years, 4 years, and 5 years (Fig. 5D–F). When the model’s line remained consistently above both the “All Positive” and “All Negative” lines within a certain range, a larger range was associated with a greater net benefit, reflecting better model performance. The analysis revealed that the LASSO regression model’s clinical prediction efficacy ranked: 5 years > 4 years > 3 years.

Fig. 5. Prognostic Analysis. A–C 3-year (A), 4-year (B), and 5-year (C) Calibration curves of the BRCA PRM. D–F Prognosis of BRCA risk model 3 years (D), 4 years (E), 5 years (F) decision curve analysis (DCA)

GO and KEGG enrichment analyses

GO and KEGG pathway enrichment analyses were executed to further investigate the relationships between BPs, CCs, MFs, and biological pathways (KEGG) for the two Model Genes in BRCA. The detailed outcomes from the GO and KEGG enrichment analyses are depicted in Table 4. The examination demonstrated that the two Model Genes in BRCA exhibited enrichment predominantly in BPs encompassing bleb assembly, positive modulation of phagocytosis, positive modulation of membrane invagination, modulation of plasminogen activation, and phagocytosis. Regarding cellular components, the Model Genes exhibited enrichment in activities such as voltage-gated chloride channel activity, phospholipid scramblase activity, voltage-gated anion channel activity, and intracellular calcium-activated chloride channel activity. Regarding molecular functions, they were associated with the chloride channel complex, tertiary granule membrane, and specific granule membrane. Notably, no significant KEGG pathway enrichment was observed in this analysis. Bubble plots illustrated the GO enrichment analysis outcomes (Fig. 6A).

Table 4. Result of GO and KEGG enrichment analysis for model genesONTOLOGYIDDescriptionGeneRatioBgRatiopvaluep.adjustqvalueBPGO:0032060bleb assembly1/210/18,8000.0010635750.0181728310.000164908BPGO:0060100Positive regulation of phagocytosis, engulfment1/213/18,8000.0013825370.0181728310.000164908BPGO:1,905,155Positive regulation of membrane invagination1/213/18,8000.0013825370.0181728310.000164908BPGO:0010755Regulation of plasminogen activation1/215/18,8000.001595150.0181728310.000164908BPGO:0060099Regulation of phagocytosis, engulfment1/215/18,8000.001595150.0181728310.000164908MFGO:0005247Voltage-gated chloride channel activity1/211/18,4100.0011946780.0086000470.000822971MFGO:0017128Phospholipid scramblase activity1/214/18,4100.0015203760.0086000470.000822971MFGO:0008308Voltage-gated anion channel activity1/217/18,4100.001846020.0086000470.000822971MFGO:0005229Intracellular calcium activated chloride channel activity1/218/18,4100.0019545560.0086000470.000822971MFGO:0061778Intracellular chloride channel activity1/218/18,4100.0019545560.0086000470.000822971CCGO:0034707Chloride channel complex1/252/19,5940.0053008390.027801673CCGO:0070821Tertiary granule membrane1/273/19,5940.007437570.027801673CCGO:0035579Specific granule membrane1/291/19,5940.0092672240.027801673CCGO:0042581Specific granule1/2160/19,5940.0162652640.030006336CCGO:0070820Tertiary granule1/2164/19,5940.0166701870.030006336GO gene ontology, BP biological process, CC cellular component, MF molecular function

Additionally, network maps for BPs, CCs, and MFs were created based on the GO enrichment analysis (Fig. 6B–D). These maps illustrate the relationships between the corresponding annotations and molecules, where nodes with larger entries represent a higher number of associated molecules.

Fig. 6GO enrichment analysis for model genes. A The scatter visualization of GO enrichment analysis findings for Model Genes illustrated: BP, CC, MF, with GO terms on the vertical axis. B–D Model Genes (BP (B), CC (C), MF (D)) gene ontology (GO) enrichment analysis network illustration. Pink vertices represent entries, blue vertices indicate molecules, while connections demonstrate entry-molecule relationships. Within the scatter plot, circle dimensions reflect gene quantities, while color gradients indicate p-value magnitude. Deeper blue signifies lower p-values, while redder hues represent larger p-values. GO enrichment analysis employed a p value < 0.05 selection criterion

GSEA for high- and low-risk cohorts

To perform a differential analysis of BRCA specimens in the TCGA-BRCA dataset, BRCA samples were stratified into high-risk and low-risk cohorts utilizing the median risk score from the BRCA Risk model. DEGs between these cohorts were ascertained, with the following results: 1,138 DEGs satisfied the parameters of|logFC| >2 and p-value < 0.05. Within this set, 1129 genes exhibited increased expression (logFC > 2 and p-value < 0.05), while 9 genes showed decreased expression (logFC < −2 and p-value < 0.05). A volcano plot illustrated these comparative findings (Fig. 7A). Furthermore, the heatmap displaying the logFC of the leading 10 increased and decreased genes (Fig. 7B).

To investigate gene expression influence on BRCA, GSEA was conducted on TCGA-BRCA BRCA specimens to explore connections between gene expression and various BPs, cellular components, and molecular functions (Fig. 7C). GSEA findings are listed in Table 5. The investigation demonstrated significant gene enrichment in multiple essential pathways, encompassing the KEAP1-NFE2L2 pathway (Fig. 7D), Nuclear Events Mediated by NFE2L2 (Fig. 7E), NFE2L2 pathway (Fig. 7E), TP53 Regulation of Metabolic Genes (Fig. 7F), and the TNFR2 non-canonical NF-κB pathway (Fig. 7G), among others, highlighting their potential roles in BRCA biology and signaling pathways.

Fig. 7GSEA for Risk Cohort. A Volcano plot of DEGs analysis between High Risk cohort and Low Risk cohort in BRCA samples in the Breast Cancer Dataset (TCGA-BRCA). B logFC ranked Top10 up-regulated and Top10 down-regulated DEGs heatmap of BRCA samples in High Risk cohort and Low Risk cohort. C Display of 4 biological function bubble plots from GSEA of Breast cancer dataset (TCGA-BRCA). D–G GSEA depicted that genes in the breast cancer dataset (TCGA-BRCA) were markedly enriched in KEAP1 NFE2L2 Pathway (D), Nuclear Events Mediated By NFE2L2 (E), KEAP1 NFE2L2 pathway (D), and nuclear events mediated by NFE2L2 (E). TP53 Regulates Metabolic Genes (F), Tnfr2 non-canonical NF-κB Pathway (G). The heatmap utilizes orange for High Risk cohort and blue for Low Risk cohort designation. Expression levels are depicted in red (high) and blue (low). Bubble dimensions correspond to enriched gene quantities, while bubble coloration reflects P-value magnitude. Blue shades indicate smaller P-values, while redder shades signify larger values. GSEA employed a p value < 0.05 threshold

Table 5. Results of GSEA for risk groupIDsetSizeEnrichment scoreNESpvaluep.adjustqvalueREACTOME_KEAP1_NFE2L2_PATHWAY102− 0.298389771− 1.4210978960.0027943190.0469060140.042660272REACTOME_NUCLEAR_EVENTS_MEDIATED_BY_NFE2L279− 0.360936382− 1.6486180160.0007496490.0165876910.01508624REACTOME_TNFR2_NON_CANONICAL_NF_KB_PATHWAY100− 0.373704464− 1.7621461747.32298E−050.0024183030.002199408REACTOME_DECTIN_1_MEDIATED_NONCANONICAL_NF_KB_SIGNALING62− 0.502013035− 2.2360029131.14814E−067.55293E−056.86927E−05REACTOME_P130CAS_LINKAGE_TO_MAPK_SIGNALING_FOR_INTEGRINS150.7392680141.7074876230.0032115440.0517675380.047081751REACTOME_GRB2_SOS_PROVIDES_LINKAGE_TO_MAPK_SIGNALING_FOR_INTEGRINS150.723387011.6708072610.0056308030.0774314980.070422714REACTOME_SIGNALING_BY_NOTCH482− 0.360259982− 1.7028107690.0004484070.0109142360.009926323REACTOME_NEGATIVE_REGULATION_OF_NOTCH4_SIGNALING54− 0.54947764− 2.3809691346.7347E−074.82125E−054.38485E−05REACTOME_HEDGEHOG_OFF_STATE112− 0.335724046− 1.6184556690.0001903140.0055145760.005015419REACTOME_HEDGEHOG_LIGAND_BIOGENESIS65− 0.473871235− 2.1303899851.42935E−050.0006690440.000608485REACTOME_TP53_REGULATES_METABOLIC_GENES84− 0.36537644− 1.724501980.0001855750.0054939360.004996647REACTOME_KEAP1_NFE2L2_PATHWAY102− 0.298389771− 1.4210978960.0027943190.0469060140.042660272REACTOME_NUCLEAR_EVENTS_MEDIATED_BY_NFE2L279− 0.360936382− 1.6486180160.0007496490.0165876910.01508624

GSVA for high- and low-risk cohorts

To examine distinctions between elevated and reduced risk categories in TCGA-BRCA dataset specimens, GSVA implementation utilized the c2.cp.v2023.2.Hs.symbols.gmt gene collection. The examination encompassed all BRCA dataset genes, with findings presented in Table 6. Post-GSVA identification focused on the top 20 pathways exhibiting p-value < 0.05 and maximum absolute logFC values. These pathways underwent differential expression analysis between elevated and reduced risk categories, with visualization through a heatmap (Fig. 8A). The heatmap effectively demonstrates pathway expression variations between risk categories. For validation purposes, Mann-Whitney U testing provided statistical verification, with findings depicted in a group comparison chart (Fig. 8B). The analysis revealed statistically significant pathways, confirming expression differences between high-risk and low-risk cohorts(p-values < 0.05).

Table 6. Results of GSVA for risk groupPathwaylogFCAveExprtP.Valueadj.P.ValBKEGG MEDICUS VARIANT MUTATION CAUSED ABERRANT TDP43 TO ELECTRON TRANSFER IN COMPLEX I− 0.649960922− 0.024804031− 16.060943873.16E−522.83E−49107.9126842KEGG MEDICUS REFERENCE ELECTRON TRANSFER IN COMPLEX I− 0.647630454− 0.024581824− 15.926865551.80E−519.21E−49106.1863574KEGG MEDICUS VARIANT MUTATION CAUSED ABERRANT SNCA TO ELECTRON TRANSFER IN COMPLEX I− 0.640529328− 0.024078342− 16.048817453.70E−522.83E−49107.7561845KEGG MEDICUS VARIANT MUTATION INACTIVATED PINK1 TO ELECTRON TRANSFER IN COMPLEX I− 0.637070915− 0.023625245− 15.860163174.27E−511.87E−48105.3308615WP MITOCHONDRIAL COMPLEX IV ASSEMBLY− 0.619678527− 0.035739229− 15.316537724.45E−481.36E−4598.44287446REACTOME COMPLEX I BIOGENESIS− 0.617311196− 0.026612511− 16.26534752.18E−533.35E−50110.5614932WP MITOCHONDRIAL COMPLEX I ASSEMBLY MODEL OXPHOS SYSTEM− 0.616000229− 0.028324822− 16.567858074.04E−551.24E−51114.5186333WP OXIDATIVE PHOSPHORYLATION− 0.614585594− 0.028012153− 15.527186553.07E−491.04E−46101.0938647KEGG MEDICUS REFERENCE MITOCHONDRIAL COMPLEX UCP1 IN THERMOGENESIS− 0.611982814− 0.026983059− 15.977628459.33E−525.72E−49106.8389073WP MITOCHONDRIAL COMPLEX III ASSEMBLY− 0.611972531− 0.031458591− 14.59703233.45E−447.06E−4289.56616686KEGG MEDICUS REFERENCE ELECTRON TRANSFER IN COMPLEX III− 0.601838416− 0.02861327− 12.505578981.29E−331.02E−3165.45926497BIOCARTA SM PATHWAY− 0.601521532− 0.034550359− 12.403141293.97E−332.83E−3164.34818533KEGG MEDICUS VARIANT MUTATION CAUSED ABERRANT ABETA TO ELECTRON TRANSFER IN COMPLEX I− 0.60038374− 0.027739762− 15.835135295.90E−512.26E−48105.0104396REACTOME FORMATION OF ATP BY CHEMIOSMOTIC COUPLING− 0.59626833− 0.028984448− 13.026222773.87E−363.59E−3471.21067852KEGG MEDICUS REFERENCE ELECTRON TRANSFER IN COMPLEX IV− 0.589107255− 0.043329756− 13.166250697.87E−378.32E−3572.78675572WP ELECTRON TRANSPORT CHAIN OXPHOS SYSTEM IN MITOCHONDRIA− 0.577611407− 0.025009271− 14.899981098.23E−461.80E−4393.26953991WP IRON SULFUR CLUSTER BIOGENESIS− 0.573733757− 0.049065231− 14.580805154.21E−448.07E−4289.36923004REACTOME MITOCHONDRIAL IRON SULFUR CLUSTER BIOGENESIS− 0.570711009− 0.047993282− 14.445585652.20E−433.54E− 4187.73389703REACTOME MITOCHONDRIAL TRANSLATION− 0.566449945− 0.021626643− 15.204317151.83E−475.11E−4597.04007782KEGG MEDICUS ENV FACTOR ARSENIC TO ELECTRON TRANSFER IN COMPLEX IV− 0.563809073− 0.032644963− 13.053724662.83E−362.71E−3471.51925948

Fig. 8GSVA Analysis. A,** B** Heat map (A) and comparative visualization (B) of gene set variant analysis (GSVA) findings between High Risk and Low Risk cohorts from TCGA-BRCA dataset specimens. *** denotes p value < 0.001, indicating substantial statistical significance. Orange indicates the High Risk cohort while blue signifies the Low Risk cohort. The selection parameters for gene set variation analysis (GSVA) utilized p-value < 0.05, with Benjamini-Hochberg (BH) adjustment for p-value correction. In the heat map, blue indicates minimal enrichment while red signifies substantial enrichment

PPI network

Furthermore, a PPI network of the two Model Genes and their functionally similar genes was constructed utilizing the GeneMANIA website (Fig. 9). The network visually represents the co-expression links between the two Model Genes and their similar genes, with different colors in the lines indicating different types of relationships, such as co-expression or shared protein domains. The network revealed that the two Model Genes were functionally related to 20 similar proteins, which are detailed in Table S3.

Fig. 9PPI Network Analysis. GeneMANIA website forecasts interaction networks of functionally comparable Genes relative to two Model Genes. Within the illustration, circles represent Model Genes and their functionally similar counterparts, while line colors signify their interconnected functional relationships

Differential expression verification and ROC curve analysis between high- and low-risk cohorts

TCGA-BRCA dataset specimens were stratified into high-risk and low-risk cohorts utilizing median risk values from the BRCA prognostic framework. To investigate model gene expression variations in these specimens, a comparative evaluation (Fig. 10A) was implemented, revealing substantial distinctions in expression patterns of two model genes, A and B, across high-risk and low-risk cohorts. The investigation confirmed that model genes ANO6 and PLGRKT exhibited notably distinct expression patterns between these cohorts (p < 0.001). Additionally, ROC analyses were conducted to evaluate model gene predictive capabilities (Fig. 10B, C). These analyses demonstrated that model genes ANO6 and PLGRKT expression levels in the TCGA-BRCA dataset effectively distinguished between high-risk and low-risk cohorts, yielding AUC values between 0.7 and 0.9, indicating robust classification effectiveness.

Fig. 10. Differential Expression Validation and ROC Curve Analysis. A Model Genes comparison visualization between High Risk and Low Risk cohorts within BRCA specimens from TCGA-BRCA dataset. B,** C** Model Genes ANO6 (B), PLGRKT (C) ROC analysis in TCGA-BRCA dataset specimens. *** denotes p value < 0.001, indicating substantial statistical significance. AUC exceeding 0.5 suggests molecular expression’s propensity toward event occurrence. AUC proximity to 1 indicates enhanced diagnostic efficacy, with values between 0.7–0.9 demonstrating notable precision. Orange designates High Risk cohort while light blue represents Low Risk cohort. TPR, True Positive Rate; FPR, False Positive Rate

Differential expression verification and ROC curve analysis of the integrated GEO dataset

Risk scores for BRCA specimens from consolidated GEO datasets utilized risk coefficients derived from the BRCA prognostic framework. Specimens were classified into high-risk and low-risk categories utilizing median risk expression values. A comparative examination (Fig. 11A) of model gene expression patterns revealed notable variations between high-risk and low-risk categories. The assessment demonstrated that model genes ANO6 and PLGRKT exhibited substantially elevated expression in high-risk specimens compared to low-risk counterparts (p < 0.001). The pROC R package facilitated ROC analysis to assess the predictive capabilities of genes ANO6 and PLGRKT (Fig. 11B, C). Results indicated that ANO6 expression in the consolidated GEO dataset showed minimal classification precision (0.5 ≤ AUC < 0.7) in distinguishing high-risk and low-risk categories, while PLGRKT demonstrated intermediate classification precision (0.7 ≤ AUC < 0.9).

Fig. 11. Differential Expression Validation and ROC Curve Analysis. A Model Gene comparison plots between High Risk and Low Risk categories in BRCA specimens from Combined GEO Datasets. B,** C** ROC analyses of Model Genes ANO6 (B), PLGRKT (C) in BRCA specimens from integrated GEO Datasets (Combined Datasets). ** denotes p value < 0.01, indicating substantial statistical significance; *** represents p value < 0.001, demonstrating exceptional statistical significance. AUC values exceeding 0.5 suggest molecular expression’s promotion of events, with values approaching 1 indicating superior diagnostic outcomes. AUC demonstrates minimal precision from 0.5–0.7 and intermediate precision from 0.7–0.9. Orange indicates the High Risk category, while blue represents the Low Risk category

Discussion

BRCA represents a primary source of cancer-related morbidity and mortality among females globally. It is a heterogeneous disease comprising various subtypes, each with distinct biological and clinical characteristics. The high prevalence and substantial impact of BRCA on patients’ quality of life and survival underscore the critical need for ongoing research and improved therapeutic strategies.

By systematically integrating efferocytosis- and invasion-related genes, this study identified ANO6 and PLGRKT as dual-process-driven prognostic biomarkers in breast cancer. The two-gene PRM, constructed via LASSO regression, demonstrated moderate predictive accuracy in the TCGA cohort (3-year AUC: 0.605; 5-year AUC: 0.626), with significant survival stratification between high- and low-risk groups (KM log-rank P < 0.001). External validation in merged GEO datasets confirmed the robustness of PLGRKT (AUC = 0.931), though ANO6 exhibited reduced generalizability (AUC = 0.629), suggesting subtype-specific biological variability [46]. The antagonistic risk weights of ANO6 (positive coefficient) and PLGRKT (negative coefficient) reflect their potential functional interplay in balancing tumor microenvironment dynamics, providing a novel molecular lens for deciphering breast cancer heterogeneity.

Functional enrichment analyses revealed that ANO6 and PLGRKT converge on phagocytosis regulation and membrane remodeling. ANO6, a calcium-activated chloride channel with phospholipid scramblase activity, likely facilitates efferocytosis by exposing phosphatidylserine (PS) on the outer membrane—a critical “eat-me” signal for tumor-associated macrophages (TAMs) [47]. Concurrently, its ion flux regulation may promote cytoskeletal reorganization, enabling invasive pseudopod formation [48]. In contrast, PLGRKT, a plasminogen receptor, appears to inhibit ECM degradation by sequestering plasminogen from urokinase-mediated activation [49]. Collectively, high-risk BRCA is characterized by compromised antioxidant capacity, metabolic dysregulation, and suppressed inflammatory signaling, whose synergistic effects may accelerate malignant progression and poor clinical outcomes. Therapeutic strategies targeting NRF2 activation, restoring TP53-mediated metabolic control, or modulating NF-κB signaling could offer novel avenues for high-risk patient management [50].

The PPI network further linked ANO6 to calcium signaling regulators (e.g., SLC4A1) and PLGRKT to integrin-mediated adhesion molecules (e.g., ITGB1) [48]. These interactions suggest a coordinated mechanism: ANO6-mediated ion dyshomeostasis primes early metastatic niche formation, while PLGRKT loss exacerbates ECM stiffness and inflammatory signaling, collectively fueling tumor progression [51]. This spatiotemporal synergy underscores the dual-process model’s biological plausibility, where efferocytosis evasion and stromal co-option operate as complementary hallmarks of aggressive disease.

Multivariate analysis identifies lymph node metastasis (N stage), risk score, age, and T4 tumors as independent prognostic factors in high-risk BRCA patients, while M stage and T2/T3 tumors showed no significant impact. These findings provide critical insights for refining risk stratification and guiding personalized therapeutic strategies. Decision curve analysis (DCA) confirmed its clinical utility, showing net benefit advantages over TNM staging at 5-year predictions. For early-stage patients lacking conventional high-risk features, the model could refine adjuvant therapy selection—high-risk subgroups may benefit from NF-κB inhibitors (e.g., bortezomib) or oxidative stress-targeting agents [50], as suggested by GSVA-identified pathway enrichments. Importantly, PLGRKT’s stable diagnostic performance across cohorts positions it as a priority biomarker for clinical assay development [52]. Unlike conventional single-mechanism models (e.g., MMP-based invasion signatures [48] or efferocytosis-focused markers like GAS6/MFG-E8 [47]), this study pioneers a dual-pathway integration strategy. While Oncotype DX^®^ and similar multi-gene panels achieve comparable accuracy [53], their “black-box” algorithms lack mechanistic interpretability. Here, ANO6 and PLGRKT offer direct therapeutic targeting potential: ANO6 inhibitors (e.g., CaCCinh-A01) could disrupt efferocytosis-immune crosstalk [54], whereas PLGRKT agonists might restore ECM homeostasis [49]. Compared to EMT-driven models (e.g., ZEB1/TWIST1 [48]), our PRM uniquely captures microenvironmental interplay, explaining its enhanced performance in HER2 + subtypes [55]. These advances align with emerging trends in precision oncology [56], where microenvironment-aware biomarkers are prioritized for combination therapy design.

In summary, this study revealed the clinical application value of efferocytosis- and invasion-related genes in the prognostic assessment of breast cancer through a multi-omics integration strategy, and provided a theoretical basis for the development of novel targeted therapeutic strategies. Subsequent studies can verify the function of key genes by CRISPR screening and other technologies, and explore their synergistic effects with existing therapeutic regimens (e.g., CDK4/6 inhibitors), to promote the development of precision therapy for breast cancer.

Limitations of the study

The limitations of this study are fourfold: First, breast cancer is highly heterogeneous in terms of molecular characteristics, patient demographics, and geographic distribution. Although our study incorporated large-scale TCGA and GEO datasets with rigorous correction for batch effects, the model construction relied on retrospective public databases (TCGA/GEO), which are at risk of selection bias, especially since Asian populations accounted for less than 12% of the samples. This may affect the model’s generalization ability to non-European and American populations, limiting the generalizability to global populations. Second, the biological function validation of ANO6/PLGRKT is limited to bioinformatics speculation and lacks validation in organoid models or transgenic animal experiments. Wet-laboratory experiments could provide more direct evidence to support the biological mechanism of differential expression of genes associated with cell expulsion and invasion in BRCA. Third, GeneCards integrates gene-related information mainly from existing literature and databases. However, its screening results may be affected by current research bias and may overlook genes with low expression levels, insufficient investigation, or incomplete functional classification. Such limitations may exclude biologically important genes that have been implicated in the pathogenesis of breast cancer. Finally, the lack of clinical validation (e.g., prospective cohort studies or clinical trials) necessitates further confirmation of the predictive power of PRM in real-world settings.

Future research suggests in-depth exploration in four dimensions: (1) Database level: Implementation of a multi-database integration strategy using platforms such as MSigDB, STRING, and Reactome to expand the ethnic and geographic diversity of the study cohort, refine gene selection algorithms by network-based analyses, and enhance the robustness of the results by cross-platform validation methods. (2) Experimental validation level: Combining knockdown of ANO6/PLGRKT by CRISPR-Cas9 with 3D tumor sphere invasion assay and live cell imaging to dynamically resolve its effect on efferocytosis efficiency. (3) Clinical translation level: Prospective collection of samples from breast cancer patients (n > 200) treated with CDK4/6 inhibitors, and use of digital droplet PCR to monitor both efferocytosis efficiency and gene expression dynamics before and after treatment, and establishment of a prediction model for drug response. (4) Technological innovation level: Integration of spatial transcriptome (10x Visium) and multiple ion beam imaging (MIBI-TOF) to map the spatial expression of ANO6/PLGRKT at tumor-immune boundaries, and to reveal its mechanism of remodeling microenvironments that regulate immune escape.

Conclusions

In conclusion, this investigation effectively distinguished differential efferocytosis- and invasion-associated genes in BRCA and established a PRM with substantial predictive capability. The framework exhibited robust precision and distinction in forecasting patient outcomes, confirmed through diverse statistical evaluations. FEAs demonstrated that the model-associated genes participate in essential BPs, cellular structures, and molecular activities, highlighting their significance in BRCA advancement. The PPI network illuminated the gene interconnections, offering an understanding of their combined influence on the condition. Subsequent investigations should emphasize clinical validation and laboratory experimentation to enhance the model’s practical medical application. Synthesizing these discoveries could support the advancement of individualized and efficient therapeutic approaches for BRCA patients.

Supplementary Information

Supplementary Material 1

Supplementary Material 2

Supplementary Material 3

Supplementary Material 4

Supplementary Material 5

Bibliography4

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Stelzer G, Rosen N, Plaschkes I, Zimmerman S, Twik M, Fishilevich S, Stein T, Nudel R, Lieder I, Mazor Y, Kaplan S, Dahary D, Warshawsky D, Guan-Golan Y, Kohn A, Rappaport N, Safran M. Lancet D. The Gene Cards suite: from gene data mining to disease genome sequence analyses. Curr Protocols Bioinformat. 2016;54.10.1002/cpbi.527322403 · doi ↗ · pubmed ↗
2Jing Y, Pei LY-SZT-CTY-QY, Liu Y-N, Cai D-J, Yu S-G, Zhao L. Integrating bulk RNA and single-cell sequencing data unveils efferocytosis patterns and Ce RNA network in ischemic stroke. Transl Stroke Res. 2024.10.1007/s 12975-024-01255-838678526 · doi ↗ · pubmed ↗
3Deng X, Thompson Jeffrey A. An R package for survival-based gene set enrichment analysis. Res Square. rs.3.rs-3367968 2023.10.7717/peerj.19489 PMC 1225816040661900 · doi ↗ · pubmed ↗
4Franz M, Rodriguez H, Lopes C, Zuberi K, Montojo JGB, Morris Q. Gene MANIA update 2018. Nucleic Acids Res. 2018;46.10.1093/nar/gky 311PMC 603081529912392 · doi ↗ · pubmed ↗