A contrastive adversarial encoder for multi-omics data integration

Ma Yinghua; Ahmad Khan; Yang Heng; Fiaz Gul Khan; Afnan Aldhahri; Iftikhar Ahmed Khan; Guanghui Liu; Guanghui Liu; Guanghui Liu

PMC · DOI:10.1371/journal.pone.0333134·October 16, 2025

A contrastive adversarial encoder for multi-omics data integration

Ma Yinghua, Ahmad Khan, Yang Heng, Fiaz Gul Khan, Afnan Aldhahri, Iftikhar Ahmed Khan, Guanghui Liu, Guanghui Liu, Guanghui Liu

PDF

Open Access

TL;DR

This paper introduces a new deep learning model called CAEncoder that improves cancer classification by better integrating multiple types of omics data.

Contribution

The novel CAEncoder combines a Vision Transformer and CycleGAN with a composite loss function for effective multi-omics data integration.

Findings

01

The CAEncoder achieved up to 93.33% classification accuracy on cancer types.

02

The model outperformed existing methods in both binary and multi-class cancer classification tasks.

03

The composite loss function successfully reduced redundancy and preserved global information in multi-omics data.

Abstract

Early and accurate cancer detection is crucial for effective treatment, prognosis, and the advancement of precision medicine. Analyzing omics data is vital in cancer research. While using a single type of omics data provides a limited perspective, integrating multiple omics modalities allows for a more comprehensive understanding of cancer. Current deep models struggle to achieve efficient dimensionality reduction while preserving global information and integrating multi-omics data. This often results in feature redundancy or information loss, overlooking the synergies among different modalities. This paper proposes a contrastive adversarial encoder (CAEncoder) for multi-omics data integration to address this challenge. The proposed model combines a Vision Transformer (ViT) and a CycleGAN, trained in an end-to-end contrastive manner. The ViT is the encoder, utilizing self-attention,…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases1

cancer

Figures2

Click any figure to enlarge with its caption.

Fig 1 — The block diagram of the proposed Model.Here, X and X¯ represent the original and synthesized multi-omics data, while Z and Z¯ denote the latent reduced space. The acronyms SN, MLP, ViT, and FFN stand for Switch Normalization, Multi-Layer Perceptron, Vision Transformer, and Feedforward Network, respectively. The symbols D1 and D2 refer to Discriminator 1 and Discriminator 2, while G1 and G2 represent Generator 1 and Generator 2. The symbol ⊕ indicates component-wise addition.

Fig 2 — Comparison of classification performance on various combinations of the multi-omics modalities.

Equations14

Funding6

—http://dx.doi.org/10.13039/501100006701Umm Al-Qura University
—http://dx.doi.org/10.13039/501100006701Umm Al-Qura University
—http://dx.doi.org/10.13039/501100006701Umm Al-Qura University
—http://dx.doi.org/10.13039/501100006701Umm Al-Qura University
—http://dx.doi.org/10.13039/501100006701Umm Al-Qura University
—http://dx.doi.org/10.13039/501100006701Umm Al-Qura University

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCell Image Analysis Techniques · Computational Drug Discovery Methods · Bioinformatics and Genomic Networks

Full text

1 Introduction

Cancer continues to be one of the most critical health challenges globally. In 2022, nearly 20 million new cancer cases were reported worldwide, resulting in 9.7 million deaths related to the disease. It is estimated that approximately one in five people will be diagnosed with cancer at some point in their lives, with one in nine men and one in twelve women expected to die from it [1]. Early cancer detection has traditionally relied on conventional machine learning algorithms and single-omics data [2–4]. However, single-omics data often falls short of capturing essential information from various biological layers. In contrast, integrating multi-omics data with deep learning has shown significant improvements over single-omics modalities [5–11].

The current research focuses on integrating various omics modalities to extract combined information for a more effective analysis of this critical disease. Wang et al. [12] utilize a transformer with multi-head self-attention and graph convolutional networks (GCN) to integrate these multi-omics modalities. Their results indicate an accuracy of 83.0% for Alzheimer’s classification and 86.7% for breast cancer classification. Lan et al. [13] proposed an integration model called DeepKEGG, which leverages biological hierarchical modules in the local connections of nodes to improve interpretability. This model also includes a pathway self-attention mechanism to explore correlations between different samples. Additionally, Zheng et al. [14] introduced a method called GCFANet, which processes multimodal omics data through global and cross-modal feature aggregation, feature confidence learning, and a GCN branch. Experimental results demonstrate that this method effectively enhances the classification performance of multi-omics data [15]. Furthermore, Li et al. [16] introduced a novel end-to-end multi-omics Graph Neural Network (GNN) framework for cancer classification, utilizing heterogeneous multilayer graphs to integrate both intra-omics and inter-omics connections. For breast cancer subset classification, Huang et al. [17] proposed a deep-learning framework called DSCCN. This method conducts differential analysis on multi-omics expression data to identify differentially expressed genes and employs sparse canonical correlation analysis to extract highly correlated features among these genes. These features are then trained separately using a multi-task deep learning neural network to predict breast cancer subtypes.

Zhu et al. [18] proposed a supervised deep learning method called the Geometric Graph Neural Network (GGNN). This approach integrates genomic geometric features and protein interaction pathway information into the deep learning model. The Denoised Multi-Omics Integration Framework [19] consists of two key components: a distribution-based feature denoising algorithm (FSD)–aimed at reducing data dimensionality, and a multi-omics integration framework (AttentionMOI)–designed for predicting cancer prognosis and identifying cancer subtypes. The results demonstrated that this model performed significantly well across 15 cancers in the TCGA database. The moBRCA-net framework [20] addresses the challenge of high-dimensional data in breast cancer classification. By integrating multi-omics data, it utilizes a feature selection module and a self-attention module to capture the relative importance of each omics modality. Deep Centroid [21] addresses challenges in omics data classification, including high-dimensional data, limited sample sizes, and source bias. Yan et al. [22] developed a hierarchical multi-level Graph Neural Network (GNN) approach that utilizes multi-omics data, gene regulatory networks, and pathway information to extract discriminative features, thereby improving the accuracy of survival risk predictions. AUTOSurv [23] utilizes a specially designed Variational Autoencoder (VAE) for the dimensionality reduction of multi-omics data. This model has demonstrated significant performance in prognosis prediction across multiple independent datasets when compared to alternative strategies and machine learning methods. Guo et al. [24] utilize network embedding technology to integrate gene co-expression data, somatic mutation data, and clinical information. By combining the struc2vec model with the random survival forest (RSF) model, they successfully predicted both long-term and short-term survival outcomes for patients with lung adenocarcinoma (LUAD).

Multi-omics data integration models have demonstrated significant improvements in cancer analysis compared to single-omics models. However, these multi-omics models still face challenges in effectively capturing synergistic features from different modalities. This limitation undermines the full potential of data integration in cancer research. Additionally, multi-omics models often prioritize stronger modalities at the expense of weaker ones, which diminishes the benefits of joint learning and negatively impacts performance in downstream tasks. Furthermore, the imbalanced nature of the data affects the overall efficiency of these models.

This paper presents a novel multi-omics integration model for cancer classification. The framework includes two main components: 1) an encoder, which utilizes a transformer to map multi-omics data to a reduced latent space, and 2) a CycleGAN that provides feedback to the encoder, enabling it to learn discriminative features and enhance generalization. The model is trained in a supervised contrastive manner, which helps bring similar modalities closer together while distancing dissimilar ones in the latent space. By employing contrastive learning, the model manages to relatively effectively mitigate the data imbalance, ensuring that all modalities are taken into account and thereby learning the synergies across them. Finally, the classification is performed in the latent space. The results indicate a significant improvement compared to current state-of-the-art methods.

2 Proposed model

The proposed model (see Fig 1) comprises two main modules: the encoder and the CycleGAN [25]. The encoder is the vision transformer (ViT) model [26], denoted as $[eqn]$ , which maps the high-dimensional multi-omics modalities $[eqn]$ into a reduced latent space $[eqn]$ , where nx is significantly greater than nz. At the same time, the CycleGAN enhances the performance of the transformer encoder by providing feedback to integrate multi-omics information, extract discriminative features, and reduce dimensionality.

The block diagram of the proposed Model.Here, X and X¯ represent the original and synthesized multi-omics data, while Z and Z¯ denote the latent reduced space. The acronyms SN, MLP, ViT, and FFN stand for Switch Normalization, Multi-Layer Perceptron, Vision Transformer, and Feedforward Network, respectively. The symbols D1 and D2 refer to Discriminator 1 and Discriminator 2, while G1 and G2 represent Generator 1 and Generator 2. The symbol ⊕ indicates component-wise addition.

Both modules are trained in an end-to-end manner, where the CycleGAN provides gradient-based feedback to the ViT encoder via adversarial and cycle-consistency losses. This joint optimization enables the encoder to learn not only from contrastive objectives but also from the reconstruction feedback, which enhances its robustness and generalization. During inference, only the ViT encoder is used to extract low-dimensional representations for downstream classification tasks.

2.1 ViT encoder

The $[eqn]$ is composed of 8 blocks, each consisting of multi-head attention (MHA), Switch Normalization (SN), and a Multi-layer Perceptron (MLP). The multi-head self-attention projects multi-omics data into subspaces calculates attention weights based on the significance of each position, and aggregates the outputs to produce the final attention output. The attention weight $[eqn]$ for a position j relative to all positions k in the ith head is computed as:

[eqn]

where $[eqn]$ , $[eqn]$ , Xi is the multi-omics modality, $[eqn]$ is the query weight matrix, and $[eqn]$ is the key weight matrix. So, the output of the ith attention head is given by:

[eqn]

where $[eqn]$ is the value matrix. Finally, the outputs from all attention heads are concatenated, $[eqn]$ , to create the final multi-head attention output.

The normalized output, denoted as $[eqn]$ , from the final multi-head attention layer is passed through a feedforward network (FFN) that consists of three layers. This network progressively transforms the output into a reduced latent vector $[eqn]$ . The first two layers of the FFN utilize the Leaky ReLU activation function, while the final layer is linear. Typically, transformers use a class token, denoted as CLS, for classification purposes; however, in this case, the CLS token is not employed because the goal is to map the high-dimensional data into a reduced latent space.

Here, the Switch Normalization (SN) [27] is used to improve both the training stability and expressive capability of the encoder. Unlike fixed normalization techniques, SN dynamically selects the most suitable normalization strategy based on the input characteristics and training conditions. The trainable coefficients are used to combine Batch Normalization (BN) and Layer Normalization (LN) to ensure Switch Normalization (SN). Given an input feature x, the SN output is computed as:

[eqn]

where $[eqn]$ and $[eqn]$ are trainable parameters that adaptively balance the contributions of BN and LN during training. This dynamic adjustment allows SN to optimize the model’s adaptability to varying data distributions, enhancing training stability and generalization performance. The experiments demonstrate that SN effectively stabilizes training by mitigating the sensitivity to batch size variations and distribution shifts.

Moreover, although the primary objective of the ViT encoder is driven by contrastive loss, its parameters are also influenced by the adversarial and cycle-consistency losses propagated from the CycleGAN. Since both generators G1 and G2 operate on the latent representation $[eqn]$ , the feedback from their respective losses flows back to the encoder. This integrated training mechanism allows the ViT encoder to refine its feature extraction by leveraging both discriminative and generative signals.

Overall, the encoder E is composed of 8 Transformer blocks, each with 8 attention heads. To project the high-dimensional input into a representation suitable for Transformer encoding, a linear projection layer is first applied to map the input to a 1024-dimensional embedding space. Following the attention mechanism, a multi-layer perceptron (MLP) consisting of three fully connected layers is used, with output dimensions of 1024 $[eqn]$ 512 $[eqn]$ 256. The first two layers employ the Leaky ReLU activation function, and the final layer is linear.

2.2 CycleGAN architecture

The proposed CycleGAN architecture consists of two generators, G1 and G2, and two discriminators, D1 and D2. This framework enables bidirectional translation between high-dimensional multi-omics data and their corresponding low-dimensional latent representations.

Generator $[eqn]$ takes a latent representation z, obtained from the encoder E, and generates a synthetic multi-omics modality $[eqn]$ that approximates the original high-dimensional data x.Generator $[eqn]$ takes the original multi-omics data x as input and reconstructs a latent representation $[eqn]$ that should resemble the true latent vector $[eqn]$ .Discriminator $[eqn]$ attempts to distinguish real high-dimensional data x from the generated data $[eqn]$ .Discriminator $[eqn]$ aims to differentiate between the true latent vector z and the reconstructed vector $[eqn]$ .

The objectives are as follows:

Generator G1 is trained to ensure that the generated data $[eqn]$ is indistinguishable from the real multi-omics data x.Generator G2 is trained to ensure that the reconstructed latent representation $[eqn]$ closely approximates the true latent vector $[eqn]$ .Discriminator D1 is trained to assign a high score to real samples and a low score to generated ones:

[eqn]

Discriminator D2 is trained similarly to distinguish real and generated latent vectors:

[eqn]

As shown in Eqs 4 and 5, the discriminators aim to maximize the prediction scores for real samples while minimizing them for generated ones, thereby guiding the generators to produce realistic outputs.

Originally, CycleGAN was developed for image generation using 2D convolution. However, in our case, the input multi-omics data $[eqn]$ is a long vector. Therefore, we have adapted CycleGAN to utilize linear convolution.

1D convolutions operate along the feature vector, enabling the model to capture dependencies across different omics features. The 1D convolution operation can be expressed as:

[eqn]

where $[eqn]$ is the input feature vector, $[eqn]$ is the convolution kernel, b is the bias term, and k is the kernel size. Furthermore, the generators G1 and G2 are designed to map input feature vectors into output vectors using a combination of linear layers and 1D convolutions. The discriminators D1 and D2 also operate on feature vectors, ensuring that adversarial training remains robust and effective for structured data. These modifications allow CycleGAN to model bidirectional mappings between different multi-omics modalities while preserving the advantages of adversarial training.

Since all generated and reconstructed data in CycleGAN rely on the latent vectors z produced by the encoder E, the encoder is directly updated through the gradients of adversarial and cycle-consistency losses. This design effectively couples CycleGAN with the encoder, enhancing the encoder’s feature learning capability beyond what contrastive learning alone can provide.

The entire model, integrating the ViT encoder with CycleGAN, is optimized using a supervised contrastive learning approach. This contrastive mechanism enables us to bring similar points closer together in the latent space while pushing dissimilar points further apart. Additionally, it facilitates the understanding of synergies among different modalities, ultimately enhancing the performance of downstream classification tasks.

After transforming the high-dimensional multi-omics data $[eqn]$ into the corresponding latent space $[eqn]$ , the classification task is then conducted within the latent space $[eqn]$ .

2.3 Loss functions

The model is trained end-to-end by integrating three types of losses: contrastive loss, adversarial loss, and cycle consistency loss. The objective is to train the encoder $[eqn]$ using a contrastive adversarial approach, effectively mapping high-dimensional data into a compact latent space. This process ultimately enhances the performance of downstream classification tasks.

Contrastive Loss: The contrastive loss [28] optimizes the encoder E to reduce the distance between similar samples while increasing the distance between dissimilar samples within the latent space z.

[eqn]

where $[eqn]$ represents the encoder, $[eqn]$ represents anchor samples, $[eqn]$ represents positive samples, $[eqn]$ represents negative samples, N is the batch size, and m is the margin used in the contrastive loss.

Adversarial loss: The Hinge Loss [27] is used as an adversarial loss because it offers better stability and faster convergence compared to traditional cross-entropy loss. This approach is particularly effective for handling high-dimensional data, as it more effectively manages the adversarial dynamics between the generator and discriminator.

The generator G1 produces synthetic high-dimensional multi-omics data $[eqn]$ from low-dimensional latent vectors z and $[eqn]$ . Meanwhile, the discriminator D1 tries to differentiate between real high-dimensional multi-omics data x and the synthetic multi-omics data $[eqn]$ . Consequently, the adversarial loss can be defined as follows:

[eqn]

In a similar manner, the generator G2 transforms high-dimensional multi-omics data x into a low-dimensional latent vector $[eqn]$ , while the discriminator D2 aims to differentiate between z and $[eqn]$ .

[eqn]

The objective of the discriminator D1 is to minimize the output value for generated data, pushing it closer to –1. This process enables D1 to effectively distinguish between real high-dimensional multi-omics data x and the generated data G(z).

[eqn]

The discriminator D2 aims to minimize the output value of the generated data, aiming to make it as close to –1 as possible. This process effectively differentiates between $[eqn]$ and $[eqn]$ .

[eqn]

where, $[eqn]$ , $[eqn]$ , and $[eqn]$ represent the anchor, positive, and negative samples, respectively, for high-dimensional multi-omics data. Similarly, $[eqn]$ , $[eqn]$ , and $[eqn]$ denote the latent low-dimensional vectors corresponding to these samples. The symbols pX and pZ refer to the real distributions of high-dimensional and low-dimensional data, respectively. The notation $[eqn]$ and $[eqn]$ indicates the expectations over high-dimensional and low-dimensional data, respectively. Lastly, $[eqn]$ represents the standard form used in Hinge Loss, ensuring that the generator’s output is as close to 1 as possible, while the discriminator’s output is as close to -1 as possible.

Cycle Consistency Loss: The cycle consistency loss is employed to ensure that the generator’s output can be accurately mapped back to the original input, thereby maintaining data consistency. This is especially crucial when working with high-dimensional multi-omics data, as it helps preserve the complex structure and biological significance of the data, preventing the generated high-dimensional output from losing its original characteristics. Furthermore, cycle consistency loss indirectly enhances the feature extraction capability and training stability of the Transformer model by minimizing feature loss and ensuring data coherence. The cycle consistency loss for the generator G1 is defined as follows:

[eqn]

The cycle consistency loss for generator G2 is given by:

[eqn]

where the L1 norm, denoted as $[eqn]$ , is used to compute the absolute error between the generated data and the original data.

Total Loss: The total loss is the weighted combination of the contrastive loss, $[eqn]$ , the cycle loss, $[eqn]$ , and $[eqn]$ ; as well as the adversarial losses $[eqn]$ and $[eqn]$ .

[eqn]

where α and β are weights that are used to balance the various losses. Therefore, the weights of the encoder are adjusted based on the total loss, $[eqn]$ , where W represents the trainable parameters of the encoder E. Contrastive learning improves the distinction between positive and negative samples, while the adversarial loss and cycle consistency loss of GAN provide feedback to the encoder E, enhancing feature extraction and resulting in better generalization.

In this integrated setup, the ViT encoder benefits not only from contrastive discrimination but also from reconstruction-based supervision, as the gradients from both the generators and discriminators in CycleGAN are backpropagated through the encoder. This unified feedback loop improves both representation quality and model robustness.

3 Experiments

The section discusses the dataset, the training and hyper-parameters setting of the model, and the quantitative results in detail.

3.1 Datasets

To illustrate the effectiveness of the CAEncoder, we utilized three cancer datasets from TCGA [29] and ROSMAP. Dataset-1 [30] is sourced from the TCGA repository and is referred to as 4-BRCA. This dataset includes multi-omics data such as Copy Number Variation (CNV), mRNA, and Reverse Phase Protein Array (RPPA) data. It encompasses four subtypes of breast cancer: Basal-like, Her2-enriched, Luminal A, and Luminal B, with a total of 511 samples. Dataset-2 [13] combines Alzheimer’s binary classification data from ROSMAP and BRCA five-class data from TCGA. It includes various modalities such as mRNA, DNA methylation, and miRNA. This dataset contains 169 samples from Alzheimer’s disease (AD) patients and 182 samples from normal controls (NC), while the five-class BRCA dataset comprises 875 samples. Dataset-3 [31] consists of data from TCGA, covering four cancer types: Prostate Adenocarcinoma (PRAD) with 250 samples, Breast Invasive Carcinoma (BRCA) with 211 samples, Bladder Urothelial Carcinoma (BLCA) with 402 samples, and Liver Hepatocellular Carcinoma (LIHC) with 354 samples. It features three modalities: mRNA, Single Nucleotide Variants (SNV), and miRNA.

Data Preprocessing: To ensure the quality and consistency of multi-omics data, categorical variables in each omics type (e.g., copy number variation (CNV), mRNA, and reverse phase protein array (RPPA)) are converted into numerical variables. All features are normalized to have a mean of 0 and a standard deviation of 1, which helps maintain consistency across different omics datasets. Furthermore, we incorporate all features from each omics type into the model, allowing it to fully capture the complex biological relationships present in the multi-omics data. To address the inherent imbalance in multi-omics data across different cancer types, we constructed balanced sets of positive and negative sample pairs for each cancer type, with the proportions reflecting their respective sample sizes. The encoder is trained in a contrastive manner, aiming to group similar points closer together in the latent space while pushing dissimilar points farther apart. Contrastive learning requires the samples to be divided into three categories: anchor, positive, and negative.

Anchor, positive and negative samples generation: We consider a multi-omics dataset comprising N samples, each containing data from M distinct modalities. For a given modality m (where $[eqn]$ ), the samples can be described as $[eqn]$ , where $[eqn]$ is a feature vector for the i-th sample and y^i^ indicates its cancer subtype. An anchor sample is formed by combining feature vectors from m selected modalities of the same patient and subtype: $[eqn]$ . For instance, for the i-th patient classified as Luminal-A breast cancer, $[eqn]$ . A positive sample x^p^ is created by selecting m modalities from different patients who share the same subtype as the anchor. It is expressed as $[eqn]$ , where $[eqn]$ . For example, if patients i, j, and k all have the Luminal-A subtype, the positive sample could be $[eqn]$ . Conversely, a negative sample x^n^ is generated by selecting at least one modality from a patient with a different class label. It is similarly represented, but at least one of the labels y^k^ must differ from that of the anchor. For instance, if $[eqn]$ , then a possible negative sample could be $[eqn]$ . Using $[eqn]$ for sample selection introduces variability and adds complexity to the training, often resulting in positive samples that are more dissimilar to the anchor despite sharing the same label. This approach enhances the model’s ability to discern subtle differences between cancer subtypes and improves robustness by exposing it to challenging examples, ultimately assisting in better generalization and reducing the risk of overfitting.

3.2 Training and hyper-parameters setting

To ensure reliable and statistically significant results, we adopted a 5-fold cross-validation protocol across all datasets. Each dataset was randomly split into five folds, with 80% used for training and 20% for testing in each iteration. This process was repeated five times using different random seeds to ensure generalizable performance. The encoder was trained using the Adam optimizer with a learning rate of $[eqn]$ . We applied gradient clipping (1.0) to prevent gradient explosion and incorporated L2 regularization (weight decay) to mitigate overfitting. The model was trained for 100 epochs with a batch size of 64.

Experiments were conducted on a system with an Intel Core i7-10700K CPU at 3.80GHz, paired with an NVIDIA GeForce RTX 3080 GPU. It has 32GB of DDR4 RAM and a 1TB SSD for fast data processing. The operating system is Ubuntu 20.04 LTS, and TensorFlow 2.5.0 was used for deep learning, with all scripts executed in Python 3 for compatibility with the latest libraries.

Our proposed model consists of a Transformer encoder and a CycleGAN framework. The computational complexity of the Transformer encoder follows Vaswani et al. [32] and can be expressed as:

[eqn]

where L is the number of layers, nt is the number of samples, and dt is the feature dimension.

The computational complexity of the CycleGAN framework can be analyzed based on the complexity of convolutional networks [25,33–35]. For a single convolutional layer, the complexity is given by:

[eqn]

where nc is the number of samples, $[eqn]$ is the convolutional kernel size, and dc is the feature dimension.

Overall, the combined computational complexity remains manageable for our dataset, allowing for efficient model training within a reasonable time frame.

3.3 Results

Classifier selection: The CAEncoder maps high-dimensional multi-omics data into a reduced latent space, making it challenging to evaluate the effectiveness of the proposed encoder directly. Therefore, we assess its performance based on downstream task classification. To accomplish this, we experimented with various classifiers, including Random Forest (RF), k-nearest Neighbors (K-NN), Decision Tree (DT), and Gradient Boosting Classifier (GBC), which were trained in the reduced latent space.

Table 1 presents the classification results of various classifiers operating in the reduced latent space. The experiments indicate that the Random Forest (RF) classifier outperforms other classifiers in effectively utilizing this latent representation, achieving the highest performance. This effectiveness arises from its ensemble learning strategy, which combines predictions from multiple decision trees. This approach reduces model variance, enhances robustness to noise, and minimizes the risk of overfitting. Furthermore, RF trains multiple sub-models on different subsets of features, thereby leveraging complementary information from various modalities to boost classification performance. Consequently, we have chosen RF as the final classifier for our model in the subsequent experiments.

Table 1: The performance of various classifiers on Dataset-1.Initially, the dataset is transformed into a reduced latent space z using the proposed encoder E, after which classification is performed in this reduced space.

The performance of CAEncoder across various modalities: Fig 2 displays the performance of the CAEncoder model on Dataset-1, showcasing various combinations of modalities: one-modality, two-modalities, and three-modalities. The CAEncoder model first transforms the high-dimensional data into a latent space, after which classification is conducted using a Random Forest (RF) algorithm with 15 estimators ( $[eqn]$ ). The figure illustrates that combining more modalities results in improved performance. This demonstrates that the proposed encoder effectively learns the synergies among different modalities, enhancing generalization.

Comparison of classification performance on various combinations of the multi-omics modalities.

Results on Dataset-1: Table 2 presents a comparison of the classification performance between our proposed CAEncoder model and traditional machine learning classifiers, including Support Vector Machine (SVM), Multilayer Perceptron (MLP), and Convolutional Neural Network (CNN), using Dataset-1. In this study, we combined three modalities—CNV, RPPA, and mRNA—for classification purposes. To reduce dimensionality and extract independent features, we applied Principal Component Analysis (PCA) before training the aforementioned classifiers in the reduced feature space. We determine the number of retained principal components by examining the cumulative explained variance ratio, as outlined in [36]. We select components that account for at least 95% of the total variance. This threshold is chosen to balance dimensionality reduction with information retention. Our goal is to reduce the dimensionality of the data for improved computational efficiency while preserving the essential information needed to maintain model performance. The results indicate that the proposed CAEncoder model outperforms the traditional machine learning approaches.

Table 2: Results on Dataset-1.

Additionally, Table 2 compares the CAEncoder model with the Multi-Omics Integration Method Based on Graph Convolutional Network for Cancer Subtype Analysis (MoGCN) [30]. The MoGCN is a deep learning model that also integrates the modalities CNV, RPPA, and mRNA from Dataset-1 for classification. The results for MoGCN [30] are not recreated here but have been cited from the original paper. The findings demonstrate that the CAEncoder model outperforms MoGCN in terms of ACC and F1 scores by 3.51% and 2.65%, respectively.

Results on Dataset-2: In this section, we compare the classification performance of the proposed CAEncoder model with several state-of-the-art deep learning approaches, including MOGONET [37], MODILM [38], HyperTMO [5], and MOCAT [31]. MOGONET [37] integrates multi-omics data using graph convolutional networks, enabling patient classification and biomarker identification. MODILM [38] enhances classification accuracy for complex diseases by synthesizing significant and complementary information from various single-omics datasets. HyperTMO [5] is a multi-omics integration framework specifically designed for patient classification. It utilizes a hypergraph convolutional network to construct hypergraph structures that represent associations between samples in single-omics data. Evidence extraction is performed via the hypergraph convolutional network, allowing for the integration of multi-omics information at an evidence level. The Multi-Omics Integration Framework with Auxiliary Classifiers-enhanced Autoencoders (MOCAT) [31] effectively leverages intra- and inter-omics information. It employs attention mechanisms combined with confidence learning to improve feature representation and ensure trustworthy predictions. The results from these approaches are cited directly from their respective papers and have not been regenerated. Table 3 presents the results of the proposed method alongside the state-of-the-art methods on Dataset-2. The results indicate that the proposed CAEncoder outperforms all the SOTA methods in terms of accuracy and F1 scores.

Table 3: Results on Dataset-2.

Results on Dataset-3: Table 4 compares the performance of CAEncoder and DeepKEGG [13] across all four cancer types included in Dataset-3. DeepKEGG is an interpretable multi-omics data integration method designed to predict cancer recurrence and identify biomarkers. It features a biological hierarchical module that establishes local connections between neuron nodes, enhancing the model’s interpretability by illustrating the relationships among genes, miRNAs, and pathways. Additionally, it includes a pathway self-attention module, which analyzes the correlations between different samples and generates potential pathway feature representations that improve the model’s prediction performance. The results indicate that CAEncoder outperforms DeepKEGG in all four areas.

Table 4: Results on Dataset-3.

3.4 Ablation study

The proposed model, CAEncoder, primarily consists of an encoder (a transformer) and CycleGAN, and it is trained using a contrastive approach. To assess the effectiveness of each component, we conducted ablation experiments (refer to Table 5).

Table 5: Results of the ablation study using Dataset-1.

In the first experiment, we kept both the encoder and the CycleGAN intact but bypassed the contrastive loss, resulting in a model variant we named CAEncoder_notCL. This experiment aimed to highlight the significance of contrastive learning. The results indicated that omitting contrastive learning significantly affected the model’s performance.

In the second experiment, we removed the CycleGAN while retaining the other components, creating a model referred to as CAEncoder_RC. The results from this experiment demonstrated that each component of the proposed model plays a crucial role in enhancing classification performance.

These ablation experiments provide compelling evidence of how the CAEncoder learns more discriminative and generalizable representations. Specifically, the substantial performance drop observed in CAEncoder_notCL confirms the pivotal role of contrastive learning in enhancing feature separability within the latent space. Similarly, the reduced accuracy and F1-score in CAEncoder_RC highlight the contribution of CycleGAN in preserving modality-specific details and preventing the loss of global structural information. Together, these components synergistically improve the quality of learned representations, leading to better generalization on downstream classification tasks.

4 Conclusion

This study introduces a novel multi-omics integration model called CAEncoder for cancer classification. This framework effectively captures the synergies among various multi-omics data and comprehensively processes complex information. Through the feedback mechanisms of CycleGAN, CAEncoder learns to identify different distributions, resulting in improved generalization. The use of contrastive learning encourages the model to understand the relationships among different modalities, thereby enhancing data integration. The encoder maps high-dimensional data into a reduced latent space, where classification is subsequently performed. We evaluated the performance of the proposed model using various datasets, and the results demonstrate that it outperforms state-of-the-art methods. In the future, we plan to extend multi-omics data integration for biomarker detection and survival prediction using self-supervised learning. Furthermore, we recognize the importance of interpretability in deep learning models and plan to explore self-attention weight analysis and feature attribution methods to elucidate the contributions of different omics features to classification decisions, thereby enhancing the model’s transparency and interpretability.

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bray F, Laversanne M, Sung H, Ferlay J, Siegel RL, Soerjomataram I, et al. Global cancer statistics 2022 : GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2024;74(3):229–63. doi: 10.3322/caac.21834 38572751 · doi ↗ · pubmed ↗
2Abbasi EY, Deng Z, Ali Q, Khan A, Shaikh A, Reshan MSA, et al. A machine learning and deep learning-based integrated multi-omics technique for leukemia prediction. Heliyon. 2024;10(3):e 25369. doi: 10.1016/j.heliyon.2024.e 25369 38352790 PMC 10862685 · doi ↗ · pubmed ↗
3Gokhale M, Mohanty SK, Ojha A. Gene Vi T: gene vision transformer with improved Deep Insight for cancer classification. Comput Biol Med. 2023;155:106643. doi: 10.1016/j.compbiomed.2023.106643 36803792 · doi ↗ · pubmed ↗
4Zhang T-H, Hasib MM, Chiu Y-C, Han Z-F, Jin Y-F, Flores M, et al. Transformer for Gene Expression Modeling (T-GEM): an interpretable deep learning model for gene expression-based phenotype predictions. Cancers (Basel). 2022;14(19):4763. doi: 10.3390/cancers 14194763 36230685 PMC 9562172 · doi ↗ · pubmed ↗
5Wang H, Lin K, Zhang Q, Shi J, Song X, Wu J, et al. Hyper TMO: a trusted multi-omics integration framework based on hypergraph convolutional network for patient classification. Bioinformatics. 2024;40(4):btae 159. doi: 10.1093/bioinformatics/btae 159 38530977 PMC 11212491 · doi ↗ · pubmed ↗
6Peelen M, Bagheriye L, Kwisthout J. Cancer subtype identification through integrating inter and intra dataset relationships in multi-omics data. IEEE Access. 2024;12:27768–83. doi: 10.1109/access.2024.3362647 · doi ↗
7Ren Y, Gao Y, Du W, Qiao W, Li W, Yang Q, et al. Classifying breast cancer using multi-view graph neural network based on multi-omics data. Front Genet. 2024;15:1363896. doi: 10.3389/fgene.2024.1363896 38444760 PMC 10912483 · doi ↗ · pubmed ↗
8Qattous H, Azzeh M, Ibrahim R, Abed Al-Ghafer I, Al Sorkhy M, Alkhateeb A. Pa CMAP-embedded convolutional neural network for multi-omics data integration. Heliyon. 2023;10(1):e 23195. doi: 10.1016/j.heliyon.2023.e 23195 38163104 PMC 10756978 · doi ↗ · pubmed ↗