Deep Neural Network Integrated into Network-Based Stratification (D3NS): A Method to Uncover Cancer Subtypes from Somatic Mutations

Matteo Valerio; Alessandro Inno; Alberto Zambelli; Laura Cortesi; Domenica Lorusso; Valeria Viassolo; Matteo Verzè; Fabrizio Nicolis; Stefania Gori

PMC · DOI:10.3390/cancers16162845·August 14, 2024

Deep Neural Network Integrated into Network-Based Stratification (D3NS): A Method to Uncover Cancer Subtypes from Somatic Mutations

Matteo Valerio, Alessandro Inno, Alberto Zambelli, Laura Cortesi, Domenica Lorusso, Valeria Viassolo, Matteo Verzè, Fabrizio Nicolis, Stefania Gori

PDF

Open Access

TL;DR

This paper introduces D3NS, a machine learning method that classifies cancer subtypes using genetic data and gene networks to improve diagnosis and treatment.

Contribution

The novel framework D3NS integrates deep neural networks with gene interaction networks to identify cancer subtypes linked to clinical outcomes.

Findings

01

D3NS identifies tumor subtypes with different survival rates and clinical associations.

02

The method was successfully applied to bladder, ovarian, and kidney cancer data from The Cancer Genome Atlas.

03

Results show potential for D3NS as a clinical tool for cancer stratification.

Abstract

Cancer develops through a complex process involving genetic changes that can lead to uncontrolled cell growth and tumor formation. This research focuses on developing an advanced approach to classify tumors into meaningful subgroups based on somatic mutations. Using machine learning techniques, specifically a deep neural network, and integrating genetic data with known gene interaction networks, we propose a framework for tumor stratification, called D3NS (deep neural network integrated into network-based stratification). This framework identifies patient subtypes predictive for survival and significantly associated with several clinical outcomes (tumor stage, grade and treatment response). We applied D3NS to real-world data from the Cancer Genome Atlas for bladder, ovarian, and kidney cancers. The results demonstrate the potential of this approach to improve cancer stratification,…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases5

bladder cancer ovarian cancer kidney cancer Cancer bladder, ovarian, and kidney cancers

Figures8

Click any figure to enlarge with its caption.

Keywords

machine learningdeep neural networkautoencodersomatic mutationscancer subtypes

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCancer Genomics and Diagnostics · Ferroptosis and cancer prognosis · Radiomics and Machine Learning in Medical Imaging

Full text

1. Introduction

Cancer is a complex disease that arises from a multistep process involving genetic and epigenetic changes, the downregulation of gene expressions, and chromosomal instability. The accumulation of somatic mutations increases with age and can cause alterations in cell growth and functions and in tissue morphologies, promoting cancer development and progression [1,2,3].

Tumors are heterogeneous diseases with variable clinical outcomes. At the molecular level, patients with similar histological and clinical characteristics often show very different genomic aberrations [4,5].

One of the main challenges in cancer informatics is the stratification of tumors into clinically and biologically significant subgroups based on the similarity of molecular profiles. This analysis involves high computational and statistical complexities, requiring increasingly sophisticated algorithms to handle a high number of variables.

Somatic mutation data have unique characteristics, such as high dimensionality and sparsity (prevalence of zero values), distinguishing them from other types of genomic data, such as gene expressions, where for almost every gene, a continuous value is assigned.

In this context, various algorithms based on modern machine learning (ML) techniques have been developed.

Some of these algorithms can be grouped under the category of network-based stratification [6], which integrates somatic mutation profiles with the knowledge of a gene interaction network to classify patients into subtypes. Briefly, the information from each mutation is diffused in the surrounding space defined by the network, propagating the signal to other functionally related genes. This propagation process helps to reduce the sparsity of the data, making them more suitable for stratification. Tumor subtypes are then identified by applying unsupervised ML algorithms, such as NMF (non-negative matrix factorization) and its variants, K-means and DBSCAN (density-based spatial clustering of applications with noise) [6,7,8,9,10].

Other approaches directly use sparse data with unsupervised algorithms derived from artificial neural networks (ANNs), such as autoencoders [11,12].

In these applications, the purpose of the autoencoder is to learn compressed representations of the input data, with which subgroups of patients can then be identified using unsupervised techniques.

With these considerations, in this paper, we propose D3NS, an algorithm for the stratification of tumors based on somatic mutations, which combines the advantages of different techniques: the knowledge contained in a molecular network, the use of an autoencoder to reduce dimensionality and the technique of consensus clustering with K-means to perform a robust stratification of patients into subgroups.

D3NS was applied to three somatic mutation datasets cataloged in The Cancer Genome Atlas (TCGA) for bladder, ovarian and kidney cancers. The algorithm was evaluated for its ability to identify subtypes of cancer significantly associated with survival and major clinical outcomes, such as tumor stage, grade or response to therapy.

2. Materials and Methods

2.1. Overview of D3NS

D3NS receives as input the set of somatic mutations of a cohort of patients and a network that describes the interactions between genes.

For each patient, somatic mutation data are represented as a binary vector, where each gene is assigned a value of “1” to indicate the presence of a mutation (regardless of the number of mutations) or “0” to indicate its absence.

The set of all the patients (organized in rows) and the set of all the possible mutated genes (arranged in columns) define a binary mutation matrix (MM), Figure 1a, with dimensions $[eqn]$ ( $[eqn]$ ), from which, with the integration of a gene network, the algorithm generates a new representation of the mutations useful for achieving better stratification.

This process is divided into the following three phases:

Network smoothing: consists of projecting (mapping), for each patient, the binary mutation profile contained in the MM onto a gene interaction network, Figure 1b. Subsequently, the network propagation process [13] is applied to spread the influence of each mutation to the surrounding space related to it. The resulting matrix, the network-smoothed matrix (NSM), will have continuous values and a much lower sparsity compared to that of the initial MM, Figure 1c;
Dimensionality reduction: the NSM is provided as input to the autoencoder, which generates its compressed representation, Figure 1d. The result is a matrix with continuous values, an encoded matrix (EM), with dimensions $[eqn]$ ( $[eqn]$ ). The number of features is a parameter of the autoencoder, which defines the number of essential features extracted from the mutations, i.e., the dimension of the latent space where mutations are mapped;
Stratification: the consensus clustering technique [14] with the K-means algorithm is applied to the EM, Figure 1e, to stratify patients into subtypes with a variable number of clusters (k = 2 ÷ 6), evaluating their associations with clinical outcomes.

2.2. Somatic Mutations and Clinical Data

Somatic mutation information and the related clinical data were retrieved from the public repository cBioPortal for cancer genomics [15] using the pyBioPortal library, specifically created in Python 3 to simplify and automate acquisition, integration, and processing operations.

Three patient cohorts were considered from studies conducted based on specific histological cancer subtypes: muscle-invasive bladder cancer (BLCA) [16], high-grade serous ovarian adenocarcinoma (OVCA) [4], and kidney renal clear cell carcinoma (KIRC) [17]. These studies are cataloged within the TCGA, a well-known research project started in 2006 by the National Cancer Institute and the National Human Genome Research Institute, which has created a database containing a wide variety of cancer data from more than 20,000 samples across 33 types of cancer.

For each dataset, patients with recorded somatic mutations were included. A summary of the distribution of variants is shown in Supplementary Figure S1, and nonsense mutations were excluded.

Table 1 shows the compositions of the three patient cohorts, with the number of genes and the sparsity characterizing the MMs.

Regarding the clinical data, survival times were downloaded in addition to available variables for each study, such as sex, age at diagnosis, tumor stage and grade.

2.3. Gene Interaction Networks

The mutations contained in the MM were projected onto a network of gene interactions, the information of which is stored in a database. To assess the effect of the network on the algorithm, three public databases were used: STRING v12.0 [18], HumanNet v3 [19], and Mentha [20].

All the databases considered provide a “confidence score” to define the degree of interaction between the pairs of genes; in this work, only the 10% most confident interactions were retained [6]. After filtering, all the networks were used in an unweighted and undirected mode.

Table 2 summarizes the characteristics of the networks considered in this analysis (filtered values are shown in parentheses).

2.4. Network Propagation

After mapping the mutation profile of each patient onto the interaction network, the propagation process, which spreads the mutation signal through the network, is applied.

The propagation process taken as a reference is the one presented in the HotNet2 algorithm [1], called Random Walk with Restart, described by the following Equation (1):

[eqn]

where F0 is the initial binary MM, F_t_ is the $[eqn]$ mutation matrix at iteration t, W is a normalized version of the adjacency matrix [13] of the considered network, and α is a tuning parameter with a value between 0 and 1, which controls the length of the diffusion paths along the network and was set at 0.7 [6].

This equation is solved iteratively for different values of t until the convergence, defined by the norm $[eqn]$ , is reached. At the end of this process, the obtained F_t_ represents the NSM.

In this work, a simplified version of the previously described propagation process was considered, which still uses the same equation but is solved only once as follows (2):

[eqn]

This solution was tested by evaluating the performance compared to that of the iterative version [9] without observing significant differences both in terms of patient stratification and in the association of the generated subtypes with clinical characteristics.

2.5. Autoencoder for Dimensionality Reduction

The NSM obtained from network propagation is provided as input to the autoencoder, an unsupervised ML algorithm belonging to the category of ANNs. It consists of two fundamental parts: an encoder and a decoder, as shown in Figure 1d.

During the training phase, the encoder compresses the input data to a “latent space”, obtaining a set of essential features. Subsequently, the decoder attempts to reconstruct the original data from these features.

The primary objective for training the autoencoder is, therefore, not only the accurate reconstruction of the input data but also, above all, the learning of a compact and meaningful representation of the data.

This compression process enables effective dimensionality reduction, allowing for the essential information contained in the data to be represented in a lower-dimensional space.

What distinguishes autoencoders from other dimensionality reduction techniques, such as principal component analysis [21], is their ability to capture complex and nonlinear relationships in the data. Autoencoders are able to leverage nonlinear activation functions and deep neural structures to learn richer and more detailed data representations.

In the simplest case, an autoencoder may have a structure composed of a single hidden layer (between the input and output layers), which number of neurons represents the dimension of the latent space, corresponding to the number of essential features. However, to address more complex problems, it is possible to use architectures typical of deep learning frameworks with multiple hidden layers, Figure 1d.

Each layer of an ANN is characterized by an activation function that determines the behavior of the neurons composing it, giving nonlinear characteristics, to the neural network.

In a given layer (excluding the input layer), each neuron produces an output signal that is dependent on the weighted sum of the signals from the neurons in the previous layer, combined with an activation function that determines the nonlinear behavior, influencing the neural network’s learning process.

In this work, ReLU, rectified linear unit, functions were used for all the hidden layers and the sigmoid function for the output layer, defined respectively by Equations (3) and (4) as follows:

[eqn]

[eqn]

The training process of the autoencoder is based on the objective for minimizing the reconstruction error between the input data and the generated output data, using a loss equation (L_rec_) that measures their difference. The loss equation is defined by the mean squared error, as shown in Equation (5):

[eqn]

where n is the number of patients, x is the input data (NSM), and $[eqn]$ is the output data (NSM reconstructed).

By minimizing L_rec_ during training, the autoencoder tries to generate an output that best approximates the original input, capturing the essential features in the data.

The minimization of L_rec_, as in all ANNs, involves the use of optimization techniques, such as backpropagation and gradient descent [22], to iteratively calculate and update the network parameters until reaching an acceptable value of L_rec_ or a certain number of epochs (where an epoch refers to a single iteration of the training process through the entire training set).

Given the complexity and high dimensionality of somatic mutation data, in this paper, an autoencoder with a deep learning architecture was implemented. Its structure comprises three hidden layers composed of 500, 100, and 500 neurons, Figure 1d.

The 100 neurons of the intermediate layer correspond to the dimension of the latent space in which the essential features, which constitute the compact representation of the input data, are defined.

At the end of the training, the mutational profile of each patient is described by these 100 new features, which will compose the EM ( $[eqn]$ ), to be used in the clustering phase.

The implementation of the autoencoder was realized in Python (3.10.12), using the Keras module within the Tensorflow library (2.8.2).

For the training of the autoencoder, the following configuration was set:

The input data were split into a training set and a validation set, with a ratio of 90/10, in order to evaluate the algorithm’s performance and prevent overfitting;
The Adam algorithm [23] was used to optimize the minimization process of L_rec_, setting a learning rate of 0.0001. The learning rate represents the size of the parameter update step of the autoencoder in the procedure for seeking the minimum L_rec_;
A batch size of 32 was set, useful for accelerating training; it defines the size of the number of samples (patients) processed by the algorithm before updating the parameters;
Training was conducted for a maximum of 150 epochs.

2.6. K-Means Consensus Clustering

After appropriately scaling the features contained in the EM, the consensus clustering technique [14] combined with the K-means algorithm [24] is applied for the clustering phase to identify the subtypes.

This technique, well-known in cancer research [25], is based on the repetition of sampling and clustering, allowing for the assessment of subtypes’ stabilities with respect to the sampling variability, increasing confidence in their real validity.

The algorithm starts by subsampling a proportion of patients and features from the EM. Each subsampling is then divided into a maximum of k groups using the K-means algorithm. After repeating this process for a specific number of iterations, pairwise consensus values are computed. These values, which indicate the proportion of times two patients were grouped together, are stored in a consensus matrix (CM) for each k value considered.

Finally, to assign each patient to a specific subtype/cluster, hierarchical agglomerative consensus clustering is applied using the distance between the consensus values.

The CM is a square matrix ( $[eqn]$ ) having patients in both rows and columns, which consensus values range from 0 (for patients never grouped in the same subtype) to 1 (for patients always grouped in the same subtype).

To assess the quality of the identified clusters, it is possible to use heatmaps, graphic representations of the values contained in the CMs, in which a continuous color scale is associated with the range 0–1, such that the value 0 corresponds to white and the value 1 to blue, Figure 1e.

By arranging the values in the CMs so that patients belonging to the same subtype are adjacent, it is possible to obtain, in the ideal case of perfect consensus, heatmaps composed of blue blocks arranged along the diagonal (the identified subtypes) on a white background.

In this work, K-means consensus clustering was implemented using the R package (4.3.2) Consensus Cluster Plus (1.66.0) [26], setting a maximum value for the k clusters to be evaluated equal to 6. The number of repetitions of the K-means algorithm was set at 1000, and for each run, 80% of the patients and 80% of the features were sampled.

2.7. Statistical Analysis for Clinical Data

The Kaplan–Meier method and the log-rank test were used to assess the differences between the overall survival (OS) probabilities among the identified patient subtypes.

In both the univariate and multivariate analyses, a Cox proportional hazard model was used to estimate hazard ratios (HRs) and 95% confidence intervals for the subtypes in relation to the OS after verifying the assumption of the proportional hazards.

In particular, for the multivariate analysis, the Cox model was initially built by including all the clinical variables that, in the univariate analysis, reached significance, with at least a p-value of <0.2. With the backward stepwise selection technique, the final model was obtained by removing the non-significant variables (p-value > 0.05). To evaluate the predictive power added by the identified subtypes, the baseline model, which includes only clinical covariates, was compared with the full model, which includes subtypes in addition to covariates. The likelihood-ratio test was used to compare the two models.

The associations between the subtypes and the available clinical variables were assessed using the Kruskal–Wallis rank sum test or the Wilcoxon rank sum test for continuous variables and Fisher’s exact test or Pearson’s chi-squared test for categorical variables.

For all the tests, a p-value of < 0.05 was considered as being statistically significant. The statistical analyses were performed using R statistical software, version 4.3.2 [27].

3. Results

The algorithm was tested on three cohorts of patients with bladder, ovarian or kidney cancers, considering three gene interaction networks: STRING v12.0, HumanNet v3, and Mentha.

For each tumor type and each molecular network, EMs were generated and used for patient stratification by applying K-means consensus clustering, considering a number of subtypes/clusters (k) ranging from 2 to 6.

In each of the three cancer types, the proposed algorithm was able to identify structurally robust subtypes, achieving similar results for the three molecular networks, as observed in the heatmaps related to the CMs (Figure 2 and Supplementary Figures S2–S4).

By varying k from 2 to 6, the associations between the identified subtypes and the available clinical variables, particularly survival, were assessed to determine their biological significance.

3.1. Bladder Cancer Data

In bladder cancer, each identified subtype was significantly associated with survival (log-rank test p-value < 0.05) for all the values of the k subtypes considered and with all the molecular networks (Figure 3a).

The subtypes most significantly associated with survival were obtained using the STRING network for k = 4 (log-rank test p-value < 0.0001) (Figure 3b).

Subtype 2 showed the worst prognosis, with a median overall survival of 23.4 months. For subtype 1, a median overall survival time of 86.8 months was observed. Subtypes 3 and 4 displayed the best survival (the median was not reached in the observation interval). The other networks provided similar results (Supplementary Figures S5–S7).

Because subtype 4 consisted of only two patients, setting subtype 3 as a reference in the univariate Cox model, subtype 2 had an HR of 3.60 (95% CI 1.75–7.38, p-value < 0.001), while subtype 1 had an HR of 1.99 (95% CI 0.93–4.25, p-value = 0.074).

In analyzing the additional clinical characteristics available from the TCGA dataset, significant associations were observed between the four subtypes and the tumor stage. Subtypes 2 and 1 have higher percentages of patients with tumor stage IV (35%) compared to the other subtypes with better survival rates (Table 3).

After evaluating the associations of the clinical variables with survival in the univariate analysis (Supplementary Table S1), a multivariate Cox proportional hazard model was built, including subtypes, age at diagnosis, and tumor stage (Table 4). The predictive power added by the four identified subtypes, when comparing the full model with the baseline model, was significant (likelihood-ratio test p-value = 0.0001).

Figure 4a provides an overview of the top 10 mutated genes in the entire population considered from the bladder cancer TCGA dataset (N = 412). Their distributions across the four subtypes are represented in Figure 4c. In the overall population, the most frequently mutated genes are TTN and TP53, present in 46.4% and 40.3% of the patients, respectively. A similar pattern is observed in each subtype, except for subtype 4, which did not show mutations in TP53; however, this subtype accounts for only two patients.

We observed significant differences (p-value < 0.0001) in the numbers of mutated genes per patient among the subtypes (Figure 4b and Supplementary Table S2). Subtypes 1 and 2 are characterized by lower numbers of mutated genes, with medians of 261 and 95, respectively, compared to subtypes 3 and 4, which displayed significantly higher medians of 572 and 2002, respectively.

3.2. Ovarian Cancer Data

In the initial analysis, the associations between the identified subtypes and survival were assessed, yielding significant results for all the values of the k clusters considered and with all the molecular networks (log-rank test p-value < 0.05), Figure 5a.

Considering three subtypes (k = 3) using the HumanNet network, the Kaplan–Meier curves obtained (log-rank test p-value = 0.0011), reported in Figure 5b, showed that patients with subtype 2 ovarian cancer had the most aggressive disease (with a median overall survival of 38 months) compared to the less aggressive subtype 3 (with a median overall survival of 66.6 months). An overview of the results obtained with the other networks is available in Supplementary Figures S8–S10.

Setting subtype 3 as the reference in the univariate Cox proportional hazard model, subtype 1 had an HR of 1.42 (95% CI 0.84–2.38, p-value = 0.19), while subtype 2 had an HR of 2.17 (95% CI 1.33–3.55, p-value = 0.002).

Among the identified subtypes and the other clinical variables available from the TCGA dataset (Table 5), significant associations were observed with the tumor stage, age at diagnosis, and response to platinum therapy after surgery. Patients in subtype 2, who have a lower survival rate, are all in stage III-IV (except for one case) compared to the other subtypes, which have higher percentages of patients in stage II. Regarding the responses to platinum therapy, patients in subtype 3 had the highest percentage of complete responses (86%) without any cases of progression or stable disease, which corresponds to the best survival rate compared to those of the other subtypes.

The associations between the clinical variables and survival are detailed in Supplementary Table S3.

In the multivariate analysis, the subtypes, age at diagnosis, and response to platinum therapy were significantly associated with survival, as shown in the final Cox model (Table 6). The predictive power added by the subtypes, compared to the baseline model, was significant (likelihood-ratio test p-value = 0.01).

Regarding somatic mutations, in the entire population considered from the ovarian cancer TCGA dataset (N = 316), TP53 dominates over all the other genes, with a prevalence of 86.4%, followed by TTN, BRCA1, and BRCA2 to a much lower extent (Figure 6a). BRCA1 and BRCA2 have been investigated in various studies highlighting their significant prognostic and predictive roles in both survival and sensitivity to platinum-based treatments. BRCA1 and BRCA2 have higher prevalences in subtypes 1 and 3 than in subtype 2 (Figure 6c).

Finally, regarding the distribution of the number of mutated genes per patient, significant differences were observed between the subtypes (p-value < 0.0001), with subtype 2 having the lowest number of mutated genes, characterized by a median of 28 mutations compared to 49 and 84 in subtypes 1 and 3, respectively (Figure 6b and Supplementary Table S4).

3.3. Kidney Cancer Data

In kidney cancer, for each value of the k cluster and for almost all the molecular networks considered, subtypes significantly associated with survival were identified (log-rank test p-value < 0.05), as shown in Figure 7a.

With the Mentha network for k = 2, the subtypes most significantly associated with survival were identified (log-rank test p-value = 0.0021), Figure 7b. Patients were classified into a high-risk group (subtype 1) or a low-risk group (subtype 2), with median overall survivals of 52.2 and 80.6 months, respectively, and an HR of 1.68 (95% CI 1.20–2.34, p-value = 0.002) for subtype 1 vs. subtype 2. For all the other results, refer to Supplementary Figures S11–S13.

In analyzing the clinical variables available from the TCGA (Table 7), a significant association was observed between the subtypes and age at diagnosis. Patients in subtype 2 had a median age of 59 years, whereas those in subtype 1 had a median age of 64 years. No significant differences were found between the subtypes for the other variables.

In the univariate analysis, all the covariates were significantly associated with survival, except for sex (Supplementary Table S5).

The identified subtypes were significant predictors of survival, adding predictive power to the baseline Cox model, which includes covariates only, such as the tumor stage, grade and age at diagnosis (likelihood-ratio test p-value = 0.02), Table 8.

Regarding somatic mutations, VHL has the highest prevalence (43.2%) in the entire population considered from the kidney cancer TCGA dataset (N = 424), followed by PBRM1 and MUC4, both slightly above 20% (Figure 8a). The distributions of the mutations across the subtypes are shown in Figure 8c.

Comparing the subtypes by the number of mutated genes per patient, a significant difference was observed (p-value < 0.0001), with subtype 1 having a median of 66 mutated genes compared to 38 in subtype 2 (Figure 8b and Supplementary Table S6).

4. Discussion

In the expanding landscape of genomic algorithms for tumor subtype characterization, the approach proposed in this study combines the strengths of various methodologies.

The proposed algorithm leverages prior knowledge from molecular networks, the power of deep neural networks in extracting hidden information from data, and the robustness of consensus clustering in revealing stable clusters within the data.

Most studies applying stratification algorithms mainly utilize continuous genomic data formats, such as gene expressions and other omics profiles. In contrast, D3NS identifies significant subtypes, both biologically and clinically, by analyzing somatic mutations encoded in binary format, categorizing each gene as mutated or not mutated.

The incorporation of molecular networks, with constantly updated databases, not only addresses the challenge of mutation sparsity but also enhances significant biological foundations essential for effective subtype classification.

The autoencoder, implemented with a deep learning architecture, mitigates high data dimensionality inherent in large gene datasets, yielding a compressed representation that retains comprehensive input information. This avoids the need for feature selection algorithms, which often restrict the informative range essential for stratification.

From a stratification perspective, the algorithm delivers well-defined clusters at the structural level, as shown by the heatmaps (Figure 2 and Supplementary Figures S2–S4), which are very similar to those obtained by other state-of-the-art algorithms for tumor subtype discovery based on somatic mutations [6,7,8,9,10].

Our results demonstrate D3NS’s capability in identifying predictive subtypes for survival (Figure 3, Figure 5 and Figure 7) and their associations with other clinical variables, such as tumor stage, grade or treatment response, confirming its validity across diverse molecular networks. Furthermore, integrating subtype information with clinical variables significantly enhances the predictive power for survival, suggesting these subtypes as molecular prognostic indicators.

This study considered datasets that include only the most common histological subtypes for each tumor localization. For bladder cancer, cases with muscle-invasive bladder cancer, both non-papillary and papillary without other histological characterizations, were included (Table 3); for ovarian cancer, the dataset focused on high-grade serous carcinoma [4], and for kidney cancer, the data refer to clear cell renal cell carcinoma [17]. This approach allowed for us to explore the molecular variability within a homogeneous group of cases, identifying subtypes with significantly different characteristics.

Several studies in serous ovarian cancer highlight the prognostic and predictive roles of BRCA1 and BRCA2 germlines and somatic mutations in survival and responses to platinum-based treatments [28,29,30,31,32,33,34]. Consistently, our findings on ovarian cancer reveal higher prevalences of BRCA1 and BRCA2 mutations in subtypes 1 and 3, which are associated with better survival and responses to treatments compared to subtype 2 (Figure 6c).

Among the top 10 mutated genes found in the kidney cancer dataset, PBRM1, BAP1, SETD2, and VHL have been implicated in tumor progression and poor prognoses [35]. As shown in Figure 8c, these mutated genes have a higher prevalence in subtype 1, correlating with worse prognoses.

Although this algorithm was tested on three datasets, its application may be extended to other tumor localizations and, within bladder, ovarian, and kidney cancers, to a wider variety of histological cancer subtypes through parameter adjustments to optimize the performance. These broader applications could potentially be used to explore the concordance between tumor subtypes identified using the algorithm and the more classical histological tumor stratification, providing further insights into the definition of tumor pathological mechanisms.

Although this research proposes a powerful methodology for tumor stratification, it has some limitations.

First, to ascertain the robustness and generalizability of the D3NS, future studies should validate its application to diverse real-world datasets of similar tumor types, including comparisons with other methods already available in the literature. Second, further analyses are needed to investigate the genetic compositions of the subtypes, given the significant difference observed in the numbers of mutated genes among them.

Additionally, this study investigated neither specific mutation classes (e.g., nonsense mutations) nor epigenetic alterations (e.g., DNA methylation and histone modifications), which potentially impact tumor suppressor genes.

Despite these limitations, we believe that the development of more sophisticated versions of the autoencoder and the use of other stratification approaches could help to better capture the biological complexities hidden in genomic data.

5. Conclusions

Identifying tumor subtypes is crucial for precise diagnoses and personalized therapies. Our study demonstrates that the D3NS algorithm is a valuable tool for tumor stratification in the context of precision medicine.

Integrating somatic gene panel testing with D3NS analysis can offer potential support in clinical settings, with opportunities for improvement through the selection of clinically relevant genes and appropriate gene interaction networks.

In conclusion, this approach can provide a base model in cancer research, adaptable for different types of cancer through necessary adjustments.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Leiserson M.D.M. Vandin F. Wu H.T. Dobson J.R. Eldridge J.V. Thomas J.L. Papoutsaki A. Kim Y. Niu B. Mc Lellan M. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes Nat. Genet.20154710611410.1038/ng.316825501392 PMC 4444046 · doi ↗ · pubmed ↗
2Jassim A. Rahrmann E.P. Simons B.D. Gilbertson R.J. Cancers make their own luck: Theories of cancer origins Nat. Rev. Cancer 20232371072410.1038/s 41568-023-00602-537488363 · doi ↗ · pubmed ↗
3Lawrence M.S. Stojanov P. Polak P. Kryukov G.V. Cibulskis K. Sivachenko A. Carter S.L. Stewart C. Mermel C.H. Roberts S.A. Mutational heterogeneity in cancer and the search for new cancer-associated genes Nature 201349921421810.1038/nature 1221323770567 PMC 3919509 · doi ↗ · pubmed ↗
4Cancer Genome Atlas Research Network Integrated genomic analyses of ovarian carcinoma Nature 201147460961510.1038/nature 1016621720365 PMC 3163504 · doi ↗ · pubmed ↗
5Cancer Genome Atlas Research Network Kandoth C. Schultz N. Cherniack A.D. Akbani R. Liu Y. Shen H. Robertson A.G. Pashtan I. Shen R. Integrated genomic characterization of endometrial carcinoma Nature 2013497677310.1038/nature 1211323636398 PMC 3704730 · doi ↗ · pubmed ↗
6Hofree M. Shen J.P. Carter H. Gross A. Ideker T. Network-based stratification of tumor mutations Nat. Methods 2013101108111510.1038/nmeth.265124037242 PMC 3866081 · doi ↗ · pubmed ↗
7Zhong X. Yang H. Zhao S. Shyr Y. Li B. Network-based stratification analysis of 13 major cancer types using mutations in panels of cancer genes BMC Genom.201516 S 710.1186/1471-2164-16-S 7-S 726099277 PMC 4474538 · doi ↗ · pubmed ↗
8He Z. Zhang J. Yuan X. Liu Z. Liu B. Tuo S. Liu Y. Network based stratification of major cancers by integrating somatic mutation and gene expression data P Lo S ONE 201712 e 017766210.1371/journal.pone.017766228520777 PMC 5433734 · doi ↗ · pubmed ↗