A multi-task domain-adapted model to predict chemotherapy response from mutations in recurrently altered cancer genes

Aishwarya Jayagopal; Robert J. Walsh; Krishna Kumar Hariprasannan; Ragunathan Mariappan; Debabrata Mahapatra; Patrick William Jaynes; Diana Lim; David Shao Peng Tan; Tuan Zea Tan; Jason J. Pitt; Anand D. Jeyasekharan; Vaibhav Rajan

PMC · DOI:10.1016/j.isci.2025.111992·February 11, 2025

A multi-task domain-adapted model to predict chemotherapy response from mutations in recurrently altered cancer genes

Aishwarya Jayagopal, Robert J. Walsh, Krishna Kumar Hariprasannan, Ragunathan Mariappan, Debabrata Mahapatra, Patrick William Jaynes, Diana Lim, David Shao Peng Tan, Tuan Zea Tan, Jason J. Pitt, Anand D. Jeyasekharan, Vaibhav Rajan

PDF

Open Access

TL;DR

This paper introduces DruID, a machine learning model that predicts chemotherapy response using limited genetic data from clinical-grade sequencing.

Contribution

The novel contribution is a deep learning model for drug response prediction using restricted gene sets and clinical-grade sequencing data.

Findings

01

Existing drug response prediction models perform similarly with whole-exome and clinical-grade NGS data.

02

DruID outperforms state-of-the-art methods in predicting chemotherapy response using sparse mutation data.

03

DruID demonstrates robust performance on real-world clinical datasets for pan-cancer and site-specific cases.

Abstract

Next-generation sequencing (NGS) is increasingly utilized in oncological practice; however, only a minority of patients benefit from targeted therapy. Developing drug response prediction (DRP) models is important for the “untargetable” majority. Prior DRP models typically use whole-transcriptome and whole-exome sequencing data, which are clinically unavailable. We aim to develop a DRP model toward the repurposing of chemotherapy, requiring only information from clinical-grade NGS (cNGS) panels of restricted gene sets. Data sparsity and limited patient drug response information make this challenging. We firstly show that existing DRPs perform equally with whole-exome versus cNGS (∼300 genes) data. Drug IDentifier (DruID) is then described, a DRP model for restricted gene sets using transfer learning, variant annotations, domain-invariant representation learning, and multi-task learning.…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases1

cancer

Figures6

Click any figure to enlarge with its caption.

Performance comparison across input subsets of genes(A) Comparison of AUROC (area under receiver operating characteristics, top left) and AUPRC (area under precision recall curve, top right) scores of response prediction using CODE-AE with different input subsets of genes. Baseline AUPRC is the fraction of positive labeled test (patient, drug) pairs with respect to all test (patient, drug) pairs. Performance (bottom) is measured over 3 randomly chosen test splits (mean ± SEM). Significance is assessed by ANOVA (*p* > 0.05).(B) Confusion matrices for different input subsets of genes on 203 samp

DruID improves response prediction results(A) Overview of DruID: during training, DruID takes as input the set of mutations available from cell lines as well as from the clinical NGS (cNGS) sequencing reports of patients (pathogenic variants and variants of uncertain significance). The mutations from both cell lines and patients are passed through stage (I) variant annotation. ClinVar, Gene-to-Protein-to-Disease (GPD), and Annovar are used to obtain annotations for each mutation. GPD returns one of 3 categories—protein information unit (PIU), linker unit (LU), or non-coding unit (NCU) for each

Performance comparison with other omic inputs(A) Comparison of AUROC and AUPRC scores of response prediction for annotated mutations, copy-number variations (CNVs), and combination of the two. Performance was measured over 3 randomly chosen test splits (containing 90, 82, and 90 samples, respectively). Left: figure shows AUROC (left) and AUPRC (right) curves obtained after combining predictions on all 3 test splits. Right: figure shows mean AUROC (left) and AUPRC (right) measured over 3 test splits with standard error. DruID trained with annotated mutations only significantly performs better t

Validation on real-world datasets(A) Comparison of AUROC across 5-fluorouracil, irinotecan, and oxaliplatin in first-line treatment of a cohort of patients with stage IV colorectal cancer (left) and AUROC across cisplatin/carboplatin and paclitaxel in first-line non-surgical cohort of patients with ovarian cancer (right).(B) Comparison of drug-specific AUPRC across 5-fluorouracil, irinotecan, and oxaliplatin in first-line treatment of a cohort of patients with stage IV colorectal cancer. Overall AUPRC is obtained across all 3 drugs. AUPRC for each drug is better than the corresponding baseline

Keywords

Biocomputational methodGenomic analysisPharmacoinformaticsCancerMachine learning

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · RNA Research and Splicing

Full text

Introduction

Precision oncology has shifted treatment paradigms in solid organ tumors over recent years, underpinned by widespread adoption of somatic next-generation sequencing (NGS) and increasing knowledge of molecular aberrations present within tumors. However, only a minority of patients undergoing NGS go on to receive biomarker-directed, or “matched,” treatment, which currently follows a single-gene, single-target approach with oncogenic drivers such as EGFR, NTRK, RET, and BRAF.1 There remains an unmet need to better tailor or repurpose treatment for the majority of patients who lack such genomic targets based on clinical-grade NGS (cNGS). Drug response prediction (DRP) models utilizing machine learning to predict therapeutic responses represent an appealing solution.

Numerous deep learning strategies have been published in recent years using available multi-omics data from cell line, patient-derived xenograft and patient datasets.2^,^3^,^4 Cancer cell lines provide the majority of ground truth drug response data for such endeavors.5^,^6^,^7 Several methods like Uno,8 GraphCDR,9 SubCDR,10 and CDRScan11 have been developed to predict drug response in cell lines; we refer the reader to previous surveys5^,^7^,^12 for details. However, DRP models trained on cell lines alone often translate poorly to patients.13^,^14^,^15 This is partly due to inherent biological differences, meaning cell lines do not accurately represent patient tumors. Cell lines are essentially a subpopulation of the primary tumor and do not exhibit heterogeneity seen in vivo. The absence of the tumor microenvironment (TME) and interactions with the host of stromal cells present in patients is also key.13^,^16 In addition, technical differences in response measurement in cell lines versus in patients, and differences in drug dosing between cell lines and patients will affect interpretation of results by a DRP model.

While omics data are increasingly available for many patients with cancer,17^,^18 drug response data for these patients remain scarce and limited to standard-of-care therapies only. To address such challenges, transfer learning approaches including domain adaptation have been developed to train DRP models from both cell lines and patients.2^,^15^,^19

Prior studies have used omics data from 4 categories: genomics (mutation, copy-number variation [CNV]), transcriptomics [gene expression microarrays, RNA sequencing], epigenomics [methylation], and proteomics [reverse-phase protein arrays]).3 While studies on cell lines have shown gene expression data to outperform mutations,20^,^21 recent studies on patients have also identified the relevance of mutations in determining survival outcomes.22 State-of-the-art transfer learning methods, which evaluated their models on patient data, have largely restricted their analysis to gene expression data.15^,^23 The genes selected in these methods are not captured based on their presence in cNGS panels, nor are the number of chosen genes comparable across cNGS and these methods. For example, CODE-AE2 used a set of 1,426 genes, which showed the most variation in gene expression values, and Velodrome15 used a set of 2,128 genes, which were chosen based on known molecular interactions among proteins. Even when these transfer learning methods2 used mutations or combinations of mutations and gene expression, they reported better performance with gene expression. Requiring transcriptomic input data represents a challenge in bringing these methods to mainstream patient care, and it remains unknown if such tools can accurately predict response from the limited number of recurrently altered cancer genes that are included in cNGS panels such as FoundationOne CDx (324 genes), Tempus xF+ (523 genes), and TruSight Oncology 500 (523 genes). To the best of our knowledge, no prior transfer learning methods have been evaluated on such a restricted subset of genes. Moreover, methods that have used mutations as inputs have not considered the variant-level information captured in cNGS reports; instead, they treat all alterations as equal, resulting in a loss of granularity and a potential reduction in predictive accuracy (Table S1).

In this paper, we make two contributions: (1) we evaluate the efficacy of extant DRP methods on the limited subset of genes available in cNGS panels, and (2) we develop a new model, called Drug IDentifier (DruID), specifically designed for use with cNGS panels and address the modeling challenges posed by such data. We first compare the performance of DRP models CODE-AE2 and Velodrome15 on subsets of genes from cNGS panels against an extended gene list from whole-exome sequencing data (WES with 19,536 genes). Although cNGS panels show no significant difference in performance compared to WES, the DRP performance itself is low for all panels. We attribute the inferior performance of existing methods to their inadequate modeling of sparse mutation data and neglecting the fine-grained variant-level information available in cNGS reports. We addressed these limitations by designing DruID. DruID leverages advanced deep learning and transfer learning techniques and a multi-stage approach comprising variant annotation-based feature engineering, unsupervised generative modeling, and supervised multi-task learning. DruID utilized both (1) fine-grained variant information in relatively abundant unlabeled (without drug response information) cell line and patient data and (2) limited labeled (with drug responses) patient data. The training procedure of DruID is carefully designed to account for both differences in mutation distribution and drug response across the domains of cell lines and patients.

DruID is shown to outperform existing state of the art DRP models in predicting response in a cohort of patients from The Cancer Genome Atlas (TCGA). Using clinical datasets from a tertiary oncology center in Singapore, DruID’s performance is validated in patients with advanced colorectal and ovarian cancer (OV), with robust response prediction seen in these clinical cohorts.

Results

Datasets used in this study

Four datasets were used in this study: DepMap (v2021Q3), TCGA, and two cancer-specific datasets, Integrated Molecular Analysis of Cancer (IMAC)-OV and IMAC-colorectal cancer (CRC), containing patients with advanced OV and CRC, respectively (Table 1). The cancer-specific datasets (IMAC-OV and IMAC-CRC) were collected as part of the ongoing IMAC and IMAC-Gynecologic Oncology (IMAC-GO) studies from the National University Cancer Institute, Singapore. The detailed inclusion and exclusion criteria, data pre-processing, and experimental procedures are documented in Methods. The mutational information present in IMAC-OV and IMAC-CRC cohorts was obtained using the FoundationOne CDx panel (324 genes); we conducted the majority of our experiments using only the genes available in this panel. We evaluated all DRP models on a subset of drugs with a sufficiently large number of recorded responses in patients.Table 1. Overview of datasets usedDepMapTCGAIMAC-OVIMAC-CRCNumber of samples/patients (N)68947010582Age, median (min-max)NA59 (24–85)59.7 (25–81)59 (37–83)Gender female (male)NA54.9% (45.1%)100%41.5% (47.3%) (missing info - 1 patient)Primary siteOvary/fallopian tube/peritoneum34 (4.9%)0105 (100%)0Colon and rectal43 (6.2%)34 (7.2%)082 (100%)Others (lung, stomach, head and neck, bladder, skin, uterus, breast, cervical, brain, pancreas, esophageal, liver, prostate, etc.)612 (88.8%)436 (92.8%)00TreatmentCisplatin/carboplatin537 (77.9%)206 (43.8%)105 (100%)0Paclitaxel676 (98.1%)113 (24%)102 (97.1%)05-Fluorouracil589 (85.5%)125 (26.6%)082 (100%)Irinotecan668 (97%)0030 (36.6%)Oxaliplatin555 (80.6%)0051 (62.2%)Omics availabilityMutationyesyesyesyesCopy numberyesyesyesyesGene expressionyesyesnonoDetails of the datasets used in this paper, including source from where each dataset was obtained, the type of cancer in each dataset, the number of available patients (samples for cell lines), and set of drugs administered to these patients, obtained after data preprocessing. TCGA, The Cancer Genome Atlas; CCLE, Cancer Cell Line Encyclopedia; IMAC, Integrated Molecular Analysis of Cancer; OV, ovarian cancer; CRC, colorectal cancer.

DRP models based on clinical NGS data perform sufficiently accurately

We first wished to evaluate if DRP models could be generated from mutational information on a subset of genes that are recurrently mutated in cancer and captured by cNGS panels. To do this, we compared the performance of CODE-AE,2 a state-of-the-art transfer learning approach, on mutation data from three subsets of genes. Subset1 comprises 324 genes included in FoundationOne CDx analyses,24 while Subset2 consists of 285 genes common across FoundationOne CDx, TruSight Oncology 500,25 and Tempus xF+26 cNGS panels. Subset3 includes 19,536 genes, nearly all those available from WES.

Pan-cancer data from TCGA (Table 1) were used to evaluate CODE-AE performance on 3 drugs (5-fluorouracil [5-FU], cisplatin, and paclitaxel) where sufficient samples (patient, drug pairs) were available. Three train-test splits were created by random sampling. In each split, CODE-AE was trained on cell line data from the Cancer Cell Line Encyclopedia and TCGA training set and evaluated on the corresponding TCGA test set. We evaluated the performance in classifying responders (categories complete response [CR] or partial response [PR] by Response Evaluation Criteria in Solid Tumors [RECIST] v.1.1 criteria) from non-responders (stable [SD] or progressive disease [PD]). In total, the test set had 203 samples (patient-drug pairs), with 90, 82, and 90 pairs across the three splits.

The classification performance of CODE-AE for the three gene subsets is shown in Figure 1. Area under receiver operating characteristics curve (AUROC) and area under precision recall curve (AUPRC) are comparable (Figure 1A), with no significant difference between gene subsets (p > 0.05, ANOVA). Figure 1 b shows the confusion matrices at a specific, arbitrarily chosen threshold (false positive rate = 0.3, true positive rate = 0.3). Subset3 (WES gene panel) enabled the identification of more responders than the Subset1 and Subset2 (47 vs. 41 and 47 vs. 44, respectively). Specificity, precision, and sensitivity metrics are equivalent across gene subsets (Table S3). Subset3 has the highest accuracy (specificity = 0.692, sensitivity = 0.311, precision = 0.746), followed by Subset2 (specificity = 0.692, sensitivity = 0.291, precision = 0.733) and Subset1 (specificity = 0.673, sensitivity = 0.272, precision = 0.707). A similar comparison, using another DRP, Velodrome, also showed no significant difference in AUROC and AUPRC between input gene subsets (Figure S2), suggesting that information from limited gene panels is sufficient to build a DRP model of similar accuracy to that from WES gene panels.Figure 1. Performance comparison across input subsets of genes(A) Comparison of AUROC (area under receiver operating characteristics, top left) and AUPRC (area under precision recall curve, top right) scores of response prediction using CODE-AE with different input subsets of genes. Baseline AUPRC is the fraction of positive labeled test (patient, drug) pairs with respect to all test (patient, drug) pairs. Performance (bottom) is measured over 3 randomly chosen test splits (mean ± SEM). Significance is assessed by ANOVA (p > 0.05).(B) Confusion matrices for different input subsets of genes on 203 samples (patient-drug pairs) from TCGA, predictions obtained using the method CODE-AE. Color indicates the input subset, and shade indicates magnitude of the values.

DruID: An improved model for predicting chemotherapy drug response with cNGS data

As seen in Figure 1, cNGS panels with limited subsets of genes have a predictive power similar to that of a WES panel. However, it can also be observed that when using CODE-AE and Velodrome (Figure S2), the overall performance is quite poor (AUROC <0.5 and AUPRC < baseline). This alludes to the need for building better predictive DRP models. We attributed this performance to (1) inadequate modeling of sparse mutation data and (2) loss of granularity by not utilizing variant information available in cNGS panels. In this section, we introduce DruID, a transfer learning-based drug identifier model, which addresses both these issues.

DruID: Model overview

There are two challenges, in building DRP models using data from cNGS panels, that we address. The first challenge arises due to sparsity in the input data. Most patients have just a few mutations among the panel of genes considered, which leads to highly sparse input features. For example, consider the FoundationOne CDx panel with 324 genes and a simple one-hot vector feature representation indicating the presence/absence of mutations. In such a case, each patient would be represented by a binary vector of length 324, which would typically have very few non-zero values. Moreover, if additional features are used per gene, the number of coordinates per gene increases and sparsity may increase further.

The second challenge arises due to limited labeled patient data for training. Previous works have utilized preclinical data (drug responses on cell lines) through domain adaptation techniques to address this challenge.2^,^15^,^23 Cell lines and patients are considered as two distinct domains as both the distributions of mutations and response to drugs differ across these two domains. The measurements of responses also differ—real-valued area under dose-response curve (AUDRC) or half-maximal inhibitory concentration values for cell lines and categorical RECIST scores for patients. While labeled patient data are limited, the number of unlabeled patient data samples (i.e., without drug responses) is much higher, and can be utilized during training, for a suitably designed model.

Our model, DruID, addresses these challenges through a synthesis of machine learning techniques (refer Table S2 for details). Figure 2A shows the three stages of DruID, with further details presented in Figure 3.

I.Variant annotations
II.Unsupervised domain-invariant representation learning
IIIMulti-task DRP Figure 2. DruID improves response prediction results(A) Overview of DruID: during training, DruID takes as input the set of mutations available from cell lines as well as from the clinical NGS (cNGS) sequencing reports of patients (pathogenic variants and variants of uncertain significance). The mutations from both cell lines and patients are passed through stage (I) variant annotation. ClinVar, Gene-to-Protein-to-Disease (GPD), and Annovar are used to obtain annotations for each mutation. GPD returns one of 3 categories—protein information unit (PIU), linker unit (LU), or non-coding unit (NCU) for each mutation. ClinVar returns one of 3 categories - benign, pathogenic, or variant of unknown significance (VUS). Annovar returns predictions indicating deleteriousness of a variant, from 17 algorithms, which are averaged to obtain a score. Next, all mutations in the same gene are aggregated over all 3 GPD categories and 3 ClinVar categories, using mean, max, sum, and count operations. These are further concatenated to obtain a 7,776-dimensional annotated mutation vector. In stage (II) unsupervised domain-invariant representation learning, the annotated mutation vectors of cell lines and patients are passed through two separate VAEs to obtain lower dimensional representations. An additional alignment is done to ensure domain-invariant representations are learnt. The VAEs also use zero-inflated distributions to model sparse data. The learnt representations along with drug Morgan fingerprints are passed to stage (III) multi-task drug response prediction, which predicts AUDRC score for cell lines and RECIST category for patients.(B) Performance of DruID and comparator methods on response prediction from TCGA patient cohort. Left: AUROC of 5 drug response prediction (DRP) methods. Right: AUPRC of 5 DRPs. Baseline AUPRC: 0.744. DruID significantly outperforms the second best performing method, Velodrome (AUROC: p = 0.004, AUPRC: p = 0.037).(C) Comparison of response prediction for each drug. Mean AUROC (above) and mean AUPRC (below) across 3 test splits with standard error corresponding to each drug. The difference in performance was not significant (p > 0.05).Figure 3. DruID method description(A) Stage (I) variant annotation starts with the extraction of mutations from the cNGS report. We first use the protein-level annotation feature of TransVar on the extracted mutations. This returns the genomic coordinates, consequence attributes for the mutation. The output of TransVar, specifically the genomic coordinate information, is used to obtain both ClinVar and Annovar annotations. ClinVar annotations, indicating the clinical significance of the mutations, are generated using filter-based Annovar annotations with ClinVar database. The resulting annotations are grouped into one of 3 categories—pathogenic, benign, and variants of unknown significance. Annovar annotations indicate whether a mutation is deleterious or tolerated, using the predictions from 17 algorithms. These annotations are encoded as a binary vector and mean aggregated to obtain a d-score per mutation. GPD annotation needs an Ensembl ID for each gene, which is generated using the MyGene package. GPD annotates each mutation based on its location as lying in a protein information unit (PIU), linker unit (LU), or non-coding unit (NCU). The output of the annotation is shown in the table on the bottom left. These are further aggregated at a gene level, as indicated on the right. All mutations in a gene belonging to each of the 3 GPD categories are aggregated using mean, max, sum, and count. This is repeated for each of the 3 ClinVar categories to obtain 4 features per category. This results in 24 features for each gene. Thus, for the Foundation One report comprising 324 genes, a 324 ∗ 24 = 7,776-dimensional annotated mutation vector is constructed per patient.(B) Stage (II) unsupervised domain-invariant representation learning involves the use of two separate variational autoencoders, one per domain (cell line or patient). The VAEs take as input the annotated mutation vectors generated in stage (I) variant annotations and learn a lower dimensional representation for each domain. To account for the sparse nature of the input data, the VAEs are trained to maximize the likelihood of the data following a zero-inflated distribution (zero-inflated negative binomial for count data and zero-inflated normal for real-valued data). To ensure the domain-invariant nature of representations, an alignment loss (CORAL loss) is introduced between the representations learnt from both VAEs. This stage does not require labeled samples and can be trained in a fully unsupervised manner.(C) Stage (III) multi-task drug response prediction uses the representations learnt for patients and cell lines from stage (II) unsupervised domain-invariant representation learning. This is achieved by attaching the trained encoders of the VAEs from stage (II). In addition, at this stage, we also introduce drug information in the form of the Morgan fingerprint. This is passed through a feedforward neural network. This is followed by a pair of task-specific feedforward neural networks—one for the regression task for AUDRC prediction on cell lines and another for the classification task of RECIST category prediction on patients. The representations of cell lines are concatenated with the drug representation, before being passed through the AUDRC predictor multi-layer perceptron (MLP). Likewise, the patient representation is concatenated with the drug representation and fed into the RECIST predictor MLP. Binary cross-entropy loss and mean square error are calculated for the classification and regression task, respectively. The network is trained for the two tasks through multi-objective optimization (Chebyshev scalarization).(D) During inference, a (patient, drug) pair is passed in. The trained network takes as input the patient mutations and annotates it using stage (I) variant annotations. This is passed through the trained encoder of the patient VAE to obtain the patient representation. The Morgan fingerprint of the drug in the test pair is passed through the trained drug embedder. The drug and patient representations are concatenated and passed through the RECIST predictor network, which returns a predicted probability of response (RECIST category of CR or PR).

In stage I, we design features (or numeric representations of the inputs) based on various functional annotations that provide fine-grained variant-specific information.27^,^28^,^29 This enables us to fully utilize the information available in cNGS panels. To the best of our knowledge, no previous approach has used variant-level information for DRP.

In stage II, numeric representations of cell lines and patients are used together to obtain another low-dimensional domain-invariant representation. This stage has multiple goals. Since mutation-based representations are extremely sparse, we use zero-inflated distributions to model them. Further, we use variational autoencoders (VAEs), which are specialized neural models, based on generative artificial intelligence, to obtain dense lower-dimensional representations. Separate VAEs for cell lines and patients are used to model distinct distributions. They are trained together to align their lower-dimensional representations such that their distributional characteristics are similar across the domains—these domain-invariant representations are then used to train a multi-task DRP model in stage III.

In stage III, a neural model is trained to predict both AUDRC for cell lines (a regression task) and response categories (responders [PR or CR] or non-responders [SD or PD]) in patients (a binary classification task) for a given input drug. The Morgan fingerprint of the drug is used as an additional input. The model is designed to simultaneously train on these two tasks, which enables both sharing of information across the two tasks and task-specific modeling in cell lines and patients, accounting for differences in their drug responses.

To validate the importance of each of the components in DruID, we conducted an ablation study by removing the variant annotation stage first, followed by the modification of the VAEs to exclude the zero-inflated distribution (Figure S3). We observe that there is a reduction in both AUROC and AUPRC with the removal of each component, indicating their importance in the overall performance.

Our modeling approach has several advantages. Since stage II is unsupervised, we can utilize large amounts of available unlabeled patient data to obtain accurate representations of patient data. Stages II and III can be first trained on pan-cancer data and then fine-tuned on input specific to a cancer type and/or drug to obtain cancer and/or drug-specific models. By using drug fingerprints as inputs in stage III, the model can predict on drugs not seen during training—important for potential applications in drug repurposing or discovery. Finally, the VAE in stage II can be extended to model multimodal data30 including additional genomic or transcriptomic inputs. Refer to Methods for further details comparing DruID against other DRP methods.

DruID improves response prediction results

We evaluate the performance of DruID and four other transfer learning-based approaches—CODE-AE,2 Velodrome,15 TCRP,19 and TUGDA23—on TCGA.

Figure 2B shows the receiver operating characteristic and precision recall curves along with the AUROC and AUPRC values of all the methods. DruID achieves the highest AUROC and AUPRC values of 0.606 and 0.782, respectively, while Velodrome is the only other DRP to achieve AUROC and AUPRC values above the respective baselines of 0.5 and 0.744. The performance of DruID is significantly better than that of Velodrome, in terms of both AUROC (p = 0.004) and AUPRC (p = 0.037).

Performance for each of three drugs, cisplatin, paclitaxel, and 5-FU, is shown in Figure 2C. DruID performs consistently across the compounds and is the only model to achieve AUROC above the 0.5 baseline for each drug, while other methods show more variations in performance. For cisplatin, DruID has the highest average AUROC (p = 0.111 compared to TCRP). For paclitaxel, AUROC of DruID and Velodrome is comparable (p = 0.496), but the variance of Velodrome is much higher. For 5-FU, the performance of DruID and CODE-AE is comparable (p = 0.371). With respect to precision recall, for cisplatin, DruID has the best mean AUPRC followed by that of TCRP (p = 0.212). For paclitaxel, DruID has the highest mean AUPRC (p = 0.408 compared to Velodrome). For 5-FU, CODE-AE has the highest mean AUPRC, followed by DruID. The difference in p value was not significant in these cases.

Inclusion of CNV or gene expression data does not improve DruID performance

We next evaluated the effect of including CNV on model performance, which is available in many cNGS panels. The data were used directly, unlike mutation data where features based on variant annotations were used. For the CNV of each gene, we had a count value indicating amplification, loss, or no change. DruID allows us to model such data using zero-inflated negative binomial distributions within the VAE training (detailed in STAR Methods).

TCGA was used for evaluation. We compared the performance of DruID on 3 different input types: annotated mutations, combined CNV and annotated mutations, and CNV alone. In all 3 cases, only 324 genes represented in FoundationOne CDx were used, and the performance was measured over 3 test splits. Results are seen in Figure 4A with DruID’s predictive performance with annotated mutations alone shown to be significantly better than annotated mutations with CNV information, both in terms of mean AUROC (p = 0.003) and AUPRC (p = 0.013) over the 3 test splits.Figure 4. Performance comparison with other omic inputs(A) Comparison of AUROC and AUPRC scores of response prediction for annotated mutations, copy-number variations (CNVs), and combination of the two. Performance was measured over 3 randomly chosen test splits (containing 90, 82, and 90 samples, respectively). Left: figure shows AUROC (left) and AUPRC (right) curves obtained after combining predictions on all 3 test splits. Right: figure shows mean AUROC (left) and AUPRC (right) measured over 3 test splits with standard error. DruID trained with annotated mutations only significantly performs better than with annotated mutations and CNV (AUROC: p = 0.003, AUPRC: p = 0.013).(B) Comparison of performance (mean AUROC, left and mean AUPRC, right) among annotated mutations, CNVs, and combination of the two. Results shown separately for each drug in the data. Mean AUROC and AUPRC of DruID with annotated mutations outperforms others (5-Fluorouracil AUROC: p = 0.004, AUPRC: p = 0.02; Paclitaxel AUROC: 0.01).(C) Comparison of AUROC and AUPRC scores of response prediction for annotated mutations, gene expression, and combination of the two. Performance was measured over 3 randomly chosen test splits. Left: figure shows AUROC (left) and AUPRC (right) curves obtained after combining predictions on all 3 test splits (containing 83, 80, and 84 samples, respectively). Right: figure shows mean AUROC (left) and AUPRC (right) measured over 3 test splits with standard error. Performance with annotated mutations was found to be significantly better than the other two input types (AUROC: p = 0.007, AUPRC: p = 0.04).(D) Comparison of performance (AUROC, left and AUPRC, right) among annotated mutations, gene expression, and combination of the two. Results shown separately for each drug in the data. ∗ indicates statistical significance using a t test between best and second-best performing inputs (p < 0.05).

Figure 4B shows the performance, across the three input data types, separately for cisplatin, paclitaxel, and 5-FU. The mean AUROC and AUPRC of DruID across the three agents are consistently higher when using annotated mutations alone as input, compared to CNV alone or CNV and annotated mutations combined. This reached statistical significance with AUPRC (p = 0.019) and AUROC (p = 0.004) for 5-FU and with AUROC for paclitaxel (p = 0.009).

Previous works have reported that gene expression has higher predictive value compared to mutation data3; however, transcriptomic data are not available in cNGS panel reports. We analyzed a subset of patients from TCGA with both transcriptomic and genomic data available to compare the performance of DruID on 3 different input types: annotated mutations only, gene expression only, and combined annotated mutations and gene expression. In all 3 cases, 324 genes represented in the FoundationOne CDx panel were used and performance measured over 3 test splits.

Figure 4C shows the performance in terms of mean AUROC and AUPRC across the 3 input types over 3 test splits. Mutation data yield the best performance, in terms of mean AUROC (p = 0.007) and AUPRC (p = 0.040). Figure 4D shows the AUROC and AUPRC values, across input data types by drug. The performance of DruID with mutational information alone was consistent across the three compounds and significantly better than gene expression containing inputs for cisplatin on both AUROC (p = 0.028) and AUPRC (p = 0.021). Differences between inputs for paclitaxel and 5-FU were non-significant.

We note that the performance of DruID, using annotated mutations, is comparable across all three drugs. However, when CNV, gene expression, or their combinations with mutations were used, the performance varied across the drugs. In our experiments, we consistently found that mutations with variant annotations yielded higher predictive signals.

Validation of DruID on real-world clinical datasets

We undertook cancer-specific clinical validation of DruID in two tumor types, CRC and OV. Data were collected from a single tertiary hospital (National University Hospital, Singapore) as part of an ongoing clinical study (Clinicaltrials.gov ID: NCT02078544). Patients enrolled underwent somatic NGS of tumor tissue or blood (ctDNA), and treatment outcomes were recorded. We included those patients sequenced via FoundationOne CDx, utilizing mutational information from the cNGS report (pathogenic and variant of unknown significance). In light of our aforementioned results, showing worse performance with addition of CNV data, we did not incorporate CNV for the model training and evaluation.

For each analysis, we divided the respective datasets (CRC, OV) into train and test splits. DruID was trained on the patient train splits (CRC or OV) and cell line datasets and was evaluated on the patient test split (CRC or OV). We evaluated the model ability to distinguish responders from non-responders ([PR or CR] versus [SD or PD] by RECIST v.1.1 criteria). These analyses were done separately and are presented further.

Cancer-specific validation

The CRC dataset includes response to 3 drugs (5-FU, irinotecan, and oxaliplatin), in the first-line metastatic settings. For the OV dataset, we included patients with advanced OV with evaluable first-line chemotherapy response (carboplatin/cisplatin and paclitaxel).

Results of the performance analysis are shown in Figure 5A with AUROC for each individual drug remaining above baseline of 0.5, with irinotecan most promising with AUROC = 0.759 in the CRC dataset and paclitaxel with AUROC = 0.638 in the OV dataset. Analysis of AUPRC (Figures 5B and 4C) highlights that DruID performs above baseline for the drugs considered.Figure 5. Validation on real-world datasets(A) Comparison of AUROC across 5-fluorouracil, irinotecan, and oxaliplatin in first-line treatment of a cohort of patients with stage IV colorectal cancer (left) and AUROC across cisplatin/carboplatin and paclitaxel in first-line non-surgical cohort of patients with ovarian cancer (right).(B) Comparison of drug-specific AUPRC across 5-fluorouracil, irinotecan, and oxaliplatin in first-line treatment of a cohort of patients with stage IV colorectal cancer. Overall AUPRC is obtained across all 3 drugs. AUPRC for each drug is better than the corresponding baseline.(C) Comparison of drug-specific AUPRC across cisplatin/carboplatin and paclitaxel in first-line treatment of a cohort of patients with ovarian cancer. Overall AUPRC is obtained across both drugs. AUPRC for each drug is better than the corresponding baseline.(D) Comparison of predicted probability of response, on true responders and non-responders in a cohort of first-line stage IV colorectal cancer patients (left) and first-line ovarian cancer patients (right). The plots indicate the DruID predicted probability of response (complete or partial response), for true responders and non-responders.(E) Oncoplots showing frequent alterations in patients with advanced (stage IV) colorectal cancer (IMAC dataset) based on DruID predicted response to 5-fluorouracil. i. Oncoplot (left) shows 15 most frequently altered genes in patients with predicted response in the bottom 20th percentile. Oncoplot (right) shows frequency of alterations in the same 15 genes listed in (i) in patients with predicted response in the top 20th percentile.(F) Oncoplots showing frequent alterations in patients with ovarian cancer (IMAC-GO dataset) based on DruID predicted response to cis/carboplatin. i. Oncoplot (left) shows 15 most frequently altered genes in patients with predicted response in the top 20th percentile. Oncoplot (right) shows frequency of alterations in the same 15 genes listed in (i) in patients with predicted response in the bottom 20th percentile.

In comparing DruID’s predicted probability of response (CR or PR) for true responders versus non-responders, we see mean output values trend higher for true responders (Figure 5D), for CRC and OV cancer patient cohorts.

To assess for obvious discriminating features between DruID predictions, we generated oncoplots (Figures 5E and 4F) for cases with DruID predictions ranked in the bottom 20th percentile versus top 20th percentile for specific drugs in the validation datasets. Figure 5E shows oncoplots for 5-FU predictions in patients from the IMAC dataset. Mutations in KRAS (59% vs. 29%) and NOTCH1 (29% vs. 6%) are more frequent in patients with a low predicted probability of response.

Oncoplots comparing gene alteration frequency in cases with top versus bottom 20th percentile of DruID response predictions to cisplatin/carboplatin across train and test splits of the IMAC-GO (OV) dataset are presented in Figure 5F. Low frequency of alterations limits interpretation. Alterations in KDM5A were seen in 4/17 (24%) cases with a high predicted probability of response to cisplatin/carboplatin but no cases with a low predicted probability.

Discussion

Prior DRP methods that perform transfer learning have largely relied on gene expression data and WES panels. However, these data are unavailable in a clinical setting, where often, only a subset of recurrently altered genes are sequenced using cNGS, to identify mutations and CNVs. Through our empirical evaluation (Figure 1), we have shown that state-of-the-art DRPs can perform comparably with mutational information from a cNGS panel and WES. This is of significance, potentially increasing the number of patients for which a DRP such as DruID could be utilized, as cNGS is increasingly being undertaken as a standard of care in oncology practice.

However, due to the relatively poor performance of existing methods on cNGS inputs, we propose a DRP model (DruID) that handles two key challenges arising in the clinical context namely (1) sparse nature of mutation data (refer STAR Methods section “modeling varying levels of sparsity” for associated experiments) and (2) limited availability of patient drug response data. While most methods handle the distributional differences between cell lines and patient genomic profiles, most methods do not handle the differences in drug response measurements across the two domains (Table S1). Further, most of these methods were trained and evaluated on gene expression data rather than mutation data. To the best of our knowledge, DruID is the first model that uses variant annotations for mutation data processing. Similar to prior methods (Velodrome and AITL), DruID simultaneously handles distribution differences in the mutation profiles and differences in the way drug response is measured across cell lines and patients. Unlike most of the prior methods, DruID has the capability to utilize unlabeled patient data. Table S21 shows a comparison between DruID and other DRP methods on gene expression data, on genes beyond those in the FoundationOne CDx report. We also conduct a comparison of DruID trained with extremely sparse data, without the cancer type-based filtering in TCGA (Table S23). The results indicate the importance of modeling data sparsity in model training, which can be explored in future work.

Our results show that DruID outperforms other state-of-the-art DRP methods (Figure 2) on publicly available TCGA mutation data. DruID shows robust performance on two clinical cohorts of patients with CRC and OV (Figure 5). These tumor types have widely different biology and molecular profiles,31^,^32 highlighting that DruID’s performance is not dependent upon the presence of certain gene/mutation signatures that may be specific to one tumor. Validation on other patient cohorts can further establish this generalizability. A limitation of our current validation in TCGA and IMAC/IMAC-GO datasets is the modest number of patients and drugs included, due to the restricted availability of labeled response data. This is a problem encountered in the validation of many DRPs, with the acquisition of reliable patient response data key to model training and performance.

The use of CNV and gene expression data did not improve results when compared to those obtained from annotated mutations alone (Figure 4). This is contradictory to prior work on cell lines, where gene expression data showed the best performance.3 The findings in our experiment suggest that DruID is superior in its ability to handle sparse mutation data; however, as our test set sizes are quite small, these results may not generalize to all patient cohorts and would need to be validated on larger patient datasets. In our experiments, we also find that the performance with combinations of different data types (mutations and CNV; mutations and gene expression) was found to be lower than that of each of the individual data types (CNV and gene expression, respectively), in most cases. This suggests that the modeling approach can be further improved with respect to integrating diverse data types. One approach could be to use multimodal techniques to handle different data types. However, this requires further investigation, which we leave to future studies.

Assessment of mutational profiles of patients with low versus high predicted probability of response suggests DruID’s ability to identify biomarkers of poor response/prognosis consistent with prior knowledge. Alterations in KRAS and NOTCH1 appeared more frequently in predicted non-responders from the CRC patient dataset (Figure 5E). This is consistent with the known function of KRAS alterations as a poor prognostic marker in CRC.33 The role of NOTCH1 as a prognostic marker is not as clear in CRC, but in esophageal squamous cell carcinoma, it is reported to be associated with cancer progression and lower response rates.34^,^35 In the analysis of patients with OV and their mutational profile by predicted probability of response, it is difficult to draw strong patterns from the frequencies presented (Figure 5F). A possible trend is seen for KDM5A alterations, appearing in patients with high predicted probability of response but absent in those with low predicted probability. This will benefit from further patient analysis. The role of KDM5A in cancer is continuing to be elucidated, but overexpression is thought to drive progression.36

DruID has significant potential as a clinically applicable DRP. Due to its design, it can be fine-tuned for any drug or cancer type, provided sufficient training data (both unlabeled and labeled) is available. Additionally, it has the possibility to be utilized as a drug repurposing tool (refer Table S22 for related experiments) as it can provide response predictions for previously unseen drugs by utilizing drug molecular information as a model input (Figure 2). This is of potential significance to patients with refractory advanced malignancies, who, in the absence of an actionable mutation being identified on cNGS, will often undergo empiric anti-cancer therapy with low expected response rates. The prospect of a drug repurposing tool that can utilize cNGS data to give a personalized treatment recommendation based on a tumor mutational profile is both exciting and appealing. Such an application of a DRP model requires prospective validation, the first steps of which we are undertaking in an ongoing trial incorporating DRP-recommended therapy in patients with refractory solid organ malignancies in Singapore (NCT05719428).

Conclusion

In this paper, we evaluated state-of-the-art DRP models on the limited subset of genes sequenced in cNGS panels and established that gene panels can perform as well as WES panels for DRP. To improve the performance of DRP models on cNGS panels, we present a transfer learning-based DRP algorithm—DruID—which can handle the sparse nature of cNGS mutation data and the limited availability of patient response to drugs. Results presented show DruID to be superior to the existing state-of-the-art DRP methods on a pan-cancer TCGA dataset with satisfactory performance seen on two cancer-specific clinical datasets. While we have utilized a panel of genes specific to one commercially available cNGS test, DruID can be altered to work on any panel gene set. Future work developing such tools for drug repurposing endeavors may provide further clinical applications.

Limitations of the study

The current design of our model could be improved further. While patients often undergo a treatment regimen comprising multiple drugs, in our experiments, we treat each drug as being administered independent of the others. To consider a regimen as a whole, a possible approach could be to use the combination of drugs as an input to the drug embedder network. The multi-task learning architecture in stage III of DruID also allows the addition of related tasks. Patient survival information can be incorporated into the model in the form of an additional task, assuming that sufficient patient data with survival outcomes are available in utilized datasets. Further improvements can be made to the model’s explainability as well. Currently, it is not inherently explainable. Explainable algorithms37 can be used over the model predictions to obtain useful insights to improve user confidence and guide clinical decision-making. The model can be improved to preserve the granular mutation-level information that may in part be lost during gene-level aggregation, to enhance the discovery of associations between drug response and alterations at the mutation level. The effect of different mutation callers on DRP performance is not clear. Our current evaluation uses TCGA mutations obtained from VarScan. We compared the variants called by VarScan38 against those obtained from a consensus mutation calling approach39 and identified that 71.43% samples have more than 60% overlap in mutations across both datasets (Figure S5). Studying the effect of different mutation callers lies outside the scope of our study but nonetheless may be a valuable area for future exploration.

In this paper, we did not focus on predicting response to immune checkpoint inhibitors (ICIs), a group of agents with an increasing number of indications in cancer treatment. Response to immune checkpoint inhibition relies on a complex interplay between cancer cells and the TME.40 Cell lines do not provide such complete cancer-TME interactions and are therefore not an adequate data source from which to model. For such a model, it would benefit from increased patient-level response data for patients treated with ICI, something our datasets lacked. This is an evolving area for future work with encouraging results seen through the use of expression data41 and other groups incorporating radiomics with clinical parameters into prediction models.42

Resource availability

Lead contact

All correspondence related to this publication are to be addressed to the lead contact, Dr. Vaibhav Rajan ([email protected]).

Materials availability

This paper does not propose the generation of new materials.

Data and code availability

•This paper analyzes existing, publicly available data, accessible at Figshare: https://doi.org/10.6084/m9.figshare.15160110.v2 and https://doi.org/10.1038/s41467-021-21997-5.
•Deidentified patient data have been deposited at Zenodo as Zenodo: https://doi.org/10.5281/zenodo.14773191. They are publicly available as of the date of publication.
•All original code has been deposited at Zenodo and is publicly available at Zenodo: https://doi.org/10.5281/zenodo.14742083 as of the date of publication.
•Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Acknowledgments

D.S.P.T. is supported by the National Medical Research Council, Singapore under its NMRC Clinician Scientist Award (MOH-001006) and has received charitable research funding from the Pangestu Family Foundation Gynaecological Cancer Research Fund. The ongoing IMAC study is supported by 10.13039/501100001381National Research Foundation, Singapore and National Medical Research Council, Singapore under its NMRC Centre Grant Program (NMRC/CG/M005/2017_NCIS).This research is supported by the 10.13039/501100001381National Research Foundation, Singapore under its AI Singapore Programme (award number: AISG-100E-2023-116). A.J. is supported by the National University of Singapore Research Scholarship. Images in this paper were created using FlatIcon and Freepik.

Author contributions

Conceptualization, A.D.J., V.R., J.J.P., A.J., and R.J.W.; data curation, A.J., R.J.W., K.K.H., D.S.P.T., T.Z.T., and R.M.; formal analysis, A.J., R.J.W., K.K.H., R.M., T.Z.T., J.J.P., A.D.J., and V.R.; funding acquisition, A.D.J., V.R., and D.S.P.T.; investigation, A.J., R.J.W., K.K.H., R.M., D.M., P.W.J., D.L., D.S.P.T., T.Z.T., J.J.P., A.D.J., and V.R.; methodology, A.J., R.J.W., K.K.H., R.M., T.Z.T., J.J.P., A.D.J., and V.R.; project administration, P.W.J. and T.Z.T.; resources, A.D.J., V.R., and D.S.P.T.; software, A.J., K.K.H., R.M., D.M., and V.R.; supervision, A.D.J. and V.R.; validation, A.J., R.J.W., K.K.H., R.M., T.Z.T., J.J.P., A.D.J., and V.R.; visualization, A.J., R.J.W., K.K.H., R.M., T.Z.T., J.J.P., A.D.J., and V.R.; writing – original draft, A.J., R.J.W., T.Z.T., A.D.J., and V.R.; writing – review and editing, A.J., R.J.W., K.K.H., R.M., D.M., P.W.J., D.L., D.S.P.T., T.Z.T., J.J.P., A.D.J., and V.R.

Declaration of interests

R.J.W. reported serving on the advisory board of Pfizer, Novartis, and Merck (MSD) and receiving honoraria from Pfizer, AstraZeneca, and Merck (MSD) outside the submitted work.

D.S.P.T. reports personal fees for advisory board membership from AstraZeneca, Bayer, Boehringer Ingelheim, Eisai, Genmab, GSK, MSD, and Roche; personal fees as an invited speaker from AstraZeneca, Eisai, GSK, Merck Serono, MSD, Roche, and Takeda; ownership of stocks/shares of Asian Microbiome Library; institutional research grants from AstraZeneca, Bayer, Karyopharm Therapeutics, and Roche; institutional funding as coordinating PI from AstraZeneca and Bergen Bio; institutional funding as local PI from Bayer, Byondis B.V., and Zeria Pharmaceutical Co., Ltd.; a previous non-remunerated role as Chair of the Asia-Pacific Gynecologic Oncology Trials Group (APGOT); a previous non-remunerated role as the Society President of the Gynecologic Cancer Group Singapore; non-remunerated membership of the Board of Directors of the GCIG; non-remunerated role as Chair of the Cervical Cancer Research Network of the GCIG; non-remunerated role as Protocol Committee Chair of APGOT; and product samples from AstraZeneca, Eisai, and MSD (non-financial interest).

R.M. and V.R. are co-founders of Spectrum Learning Analytics.

A.D.J. has received consultancy fees from DKSH/BeiGene, Roche, Gilead, Turbine Ltd, AstraZeneca, Antengene, Janssen, MSD, and IQVIA and research funding from Janssen and AstraZeneca.

STAR★Methods

Key resources table

REAGENT or RESOURCESOURCEIDENTIFIERDeposited dataTraining dataCCLE, TCGAFigshare: https://doi.org/10.6084/m9.figshare.15160110.v2 and https://doi.org/10.1038/s41467-021-21997-5Deidentified patient dataZenodo: https://doi.org/10.5281/zenodo.14773191Software and****algorithmsDruID algorithmThis paperZenodo: https://doi.org/10.5281/zenodo.14742082

Experimental model and study participant details

Datasets

TCGA

The raw mutation data was obtained from The Cancer Genome Atlas17 GDC portal.61 The response data was obtained from an earlier work.4 We only consider the cancer types belonging to the following TCGA projects/cancer types - LUAD, STAD, HNSC, SKCM, BLCA, UCEC, COAD, LUSC, BRCA, CESC. We further retain the TCGA samples which have a corresponding RECIST v1.1 response to the drugs Cisplatin, Paclitaxel, 5-Fluorouracil, Gemcitabine, Docetaxel, Cyclophosphamide. These are the drugs with at least 50 TCGA samples having a documented RECIST response. We further convert the RECIST labels into two categories - complete response(CR) and partial response(PR) are grouped together as responders while stable disease(SD) and progressive disease(PD) are grouped together as non-responders.

TCGA data filtration

We focused on only TCGA patients with RECIST response, which is necessary for the downstream Drug Response Prediction Multi-Task Learning task. From the GDC portal, mutation data in compressed Mutation Annotation Format (MAF) were obtained for each of the 33 cancer types present in TCGA. For each cancer type, we extracted out mutations present in the 324 genes available in FoundationOne CDx, based on their Entrez gene IDs. Only mutations which were “damaging” or “other non-conserving” were retained, as these mutations are found in the coding region of genes and are assumed to alter gene function. This included mutations classified as "Missense_Mutation", "In_Frame_Del", "Splice_Site", "Nonsense_Mutation", "Frame_Shift_Ins", "Frame_Shift_Del", "Nonstop_Mutation", "Translation_Start_Site", "In_Frame_Ins".62 For each cancer type, the number of genes with mutations across all patients of the cancer type considered were examined. We retained the cancer types with at least 300 genes mutated across all associated patients, to have sufficient representation across mutations in the data. This resulted in 10 cancer types - TCGA-BRCA, SKCM, COAD, LUSC, LUAD, HNSC, CESC, STAD, BLCA, UCEC, with a total of 4818 patients. We next checked the presence of copy number variations (CNV) in each of these patients, retaining only those with CNV and mutations in the 324 genes, yielding 4766 patients. We used this subset to perform subsequent evaluations comparing the viability of using various input types in the model. Next, we obtained TCGA responses for patients from an earlier work,4 for 2572 patients. We also removed patients who were given drugs without a SMILES string and drugs without a response measured in cell lines. This reduced the patients down to 596. We further removed patients with responses associated with drugs having less than 50 labelled samples. The remaining patients had responses to the drugs Cisplatin, Paclitaxel, 5-Fluorouracil, Gemcitabine, Docetaxel, Cyclophosphamide. This resulted in 470 samples.

Mutations

We first filter out and retain those TCGA samples which have a mutation classified as one of "Missense_Mutation", "In_Frame_Del", "Splice_Site", "Nonsense_Mutation", "Frame_Shift_Ins", "Frame_Shift_Del", "Nonstop_Mutation", "Translation_Start_Site", "In_Frame_Ins".

Gene expression

The gene expression data from the TCGA GDC portal(v1.29.0)^17^ was used directly.

IMAC colorectal cancer (IMAC-CRC)

The raw data was obtained from patients with advanced colorectal cancer enrolled and consented into the Integrated Molecular Analysis of Cancer (IMAC) study. The IMAC study is an ongoing prospective trial using broad panel sequencing of refractory solid-organ malignancies to identify targetable molecular alterations in the Phase I unit of the National University Cancer Institute, Singapore (NCIS). We retained patients with successful sequencing on FoundationOne CDx and available response data from their first line therapy in the metastatic setting. Drugs with more than 10 response events (patient, drug pairs) and available smiles string were included in subsequent training. These included 5-fluorouracil (includes capecitabine), irinotecan, oxaliplatin and cetuximab. We converted the RECIST labels into two categories - complete response(CR) and partial response(PR) are grouped together as responders while stable disease(SD) and progressive disease(PD) are grouped together as non-responders.

IMAC-GO ovarian cancer (IMAC-OV)

The raw data was obtained from patients with advanced ovarian cancer enrolled and consented into IMAC-Gynaecologic Oncology (IMAC-GO) study, a prospective study using broad panel sequencing of advanced gynaecological malignancies in National University Cancer Institute, Singapore (NCIS). We retained cases with successful sequencing on FoundationOne CDx and an evaluable response to first line treatment. Patients who had undergone upfront cytoreductive surgery with no remaining evaluable disease post-operatively were excluded. Regimens in retained cases included cis/carbo-platin (combined as for analysis purposes) and paclitaxel.

Both IMAC-CRC and IMAC-OV included cases utilising the FoundationOne CDx testing platform, giving mutational information of 324 genes of interest. We included reported pathogenic alterations and variants of uncertain significance.

Cell lines

The raw data was obtained from the CCLE63 DepMap portal.64 The drug response for these cell lines was obtained from the GDSC portal.65 We retained the cell lines which have a corresponding drug response measured in terms of AUDRC.

For training with datasets listed above (TCGA, IMAC-CRC and IMAC-OV) we filter cell lines with responses to drugs retained in the the dataset in question.

-For training with TCGA, we filter cell lines with responses to the drugs Cisplatin, Paclitaxel, 5-Fluorouracil, Gemcitabine, Docetaxel and Cyclophosphamide. This set of cell lines and TCGA patient samples is labelled CCLE-TCGA dataset.
-For training with IMAC-CRC, we filter cell lines with responses to the drugs 5-Fluorouracil, Irinotecan, Cetuximab and Oxaliplatin. We call this set of cell lines and IMAC-CRC samples the CCLE-CRC dataset.
-For training with IMAC-OV samples, we filter cell lines with responses to the drugs Cisplatin, Paclitaxel, Gemcitabine and Doxorubicin. This set of cell lines and IMAC-OV samples is labelled the CCLE-OV dataset.

The CCLE-TCGA dataset consists of 689 cell lines and 470 TCGA patients, the CCLE-CRC dataset contains 689 cell lines and 82 colorectal cancer patients, while the CCLE-OV dataset contains 677 cell lines and 105 ovarian cancer patients.

Mutations

We retain only the mutations that are annotated as “damaging” and “other non-conserving”.

Gene expression

The gene expression data from the CCLE DepMap portal was used directly.

Experiment settings

After processing described above, we have three datasets consisting of a combination of cell lines and patient samples: (1) CCLE-TCGA, (2) CCLE-CRC and (3) CCLE-OV. The 3 datasets (i.e. (patient, drug) pairs) were divided into 80-20 train-test splits, with 3 different random states to generate 3 splits (Tables S7–S17).

Evaluation metrics

Since we modelled the task of drug response prediction in patients as a classification problem, we evaluated the performance of the model in terms of Area Under the Receiver Operating Characteristic curve (AUROC) and the Area Under the Precision Recall Curve (AUPRC). These metrics were calculated on the held-out 20% test splits on all 3 splits for CCLE-TCGA, CCLE-OV and CCLE-CRC datasets. The overall AUROC and AUPRC were calculated for each split by considering the predictions and ground truth labels for all the drugs together. The final results were reported per drug on each dataset, for all those drugs with at least 80 RECIST responses for CCLE-OV and CCLE-TCGA datasets, and at least 10 RECIST responses for CCLE-CRC dataset. The baseline AUPRC is calculated as the fraction of positive labelled test (patient, drug) pairs with respect to all test (patient, drug) pairs.66

Features

We encode the mutation, gene expression and copy number variation data, for all datasets, into vectors of different dimensions (Table S19). These vectors can be binary (CNV, mutations) or real valued(annotated mutations, gene expression). These vectors also differ in the dimensionality based on the input genes used and the processing done. The statistics on data sparsity is also available in Table S18.

Experiments

Clinical NGS data is sufficient for DRP model performance

CODE-AE was trained on the train split of the CCLE-TCGA dataset (comprising patients and cell lines) and evaluated on the patients in the test split. 3 subsets of genes were considered:

(i)only FoundationOne cNGS panel genes (324 genes)
(ii)whole exome sequencing (WES) panel of 19536 genes and
(iii)285 genes that are common across FoundationOne, TruSight Oncology 500 and Tempus xF+ panels.

This was repeated across all 3 random train-test splits of the CCLE-TCGA dataset. Details of the input features in each subset are provided in Table S19. Results of the comparison are presented in Figures S1 and S2 and Table S3.

For WES, the feature space dimension was first reduced by using an autoencoder (AE) to project down to 324 dimensions before running CODE-AE. The AE had an encoder-decoder architecture with one bottleneck layer to project from 19536 to 324 dimensions. It was trained to minimise the Mean Squared Error (MSE) loss between input and reconstructed matrices, over 2000 epochs with a learning rate of 1e-4 and convergence threshold of 1e-5. This allowed us to train in our computing environment that had limited memory. For evaluation, the overall AUROC and AUPRC were calculated for each train-test split, by combining the predictions for each (patient, drug) pair in the test split. During evaluation only (patient, drug) pairs with 5-Fluorouracil, Cisplatin and Paclitaxel as the drug were considered. These drugs had more than 80 (patient, drug) pairs in the TCGA dataset. To calculate the overall AUROC and AUPRC, the predicted responses from CODE-AE for (patient, drug) pairs for all three drugs, 5-Fluorouracil, Cisplatin and Paclitaxel, and their ground truth RECIST labels, were considered together. To test if the performance was significantly different across the 3 gene subsets, we conducted an ANOVA test across the overall AUROC and AUPRC for the 3 test splits. This resulted in a p-value of 0.8367 for overall AUROC and 0.78 for overall AUPRC. As such, we could not reject the null hypothesis that all 3 subsets had a similar performance across 3 test splits.

Further, we combined all (patient, drug) pairs in the test splits of the 3 train-test splits, with respect to the predicted responses from CODE-AE and the ground truth RECIST labels. For (patient, drug) test pairs present in more than one test split, we took the mean predicted response across the test splits. This aggregation allowed us to combine the test split (patient, drug) pairs across the 3 test splits. In total the aggregated test set had 203 samples (patient, drug pairs on which the model predicts), with 90, 82 and 90 pairs across the three splits. To consider the differences across cancer types, we considered cancer types with more than 20 (patient, drug) test pairs. We repeated the above comparison across all 3 subsets of genes, using Velodrome as well (Figure S2; Table S4).

DruID: Predicting chemotherapy drug response with cNGS data

All the existing baseline models (CODE-AE, TCRP, TUGDA, Velodrome) were trained on the cell lines and patients in the train splits of the CCLE-TCGA dataset, and evaluated on the patients in the corresponding test splits. Only 324 genes from the FoundationOne panel were considered for all the experiments. This was repeated across all 3 train-test splits of the CCLE-TCGA dataset. For the baseline methods, the inputs were binary mutation vectors (i.e., without variant annotations). For DruID, initially a model was trained using CCLE-OV data (including all IMAC-GO patients) (annotation, unsupervised domain adaptation and multi-task learning). Then another DruID model was instantiated with these learnt weights for each of the 3 drugs - 5-Fluorouracil, Cisplatin and Paclitaxel for each train split. Each of these drug-split-specific models was trained using the CCLE-TCGA train split consisting of (patient,drug) train pairs where the drug matched the drug in the drug-split-specific model.

For evaluation, the overall AUROC and AUPRC were calculated for each train-test split, by combining the predictions for each (patient, drug) pair in the test split. To calculate the overall AUROC and AUPRC, the predicted responses from all the baseline methods for (patient, drug) pairs with 5-Fluorouracil, Cisplatin and Paclitaxel, along with their ground truth RECIST labels, were considered together. We checked the significance of overall AUROC and AUPRC across the 3 test splits for DruID and Velodrome (second-best performing model), using a t-test. We obtained a p-value of 0.004 on AUROC and 0.037 on AUPRC, indicating a significant difference between the performance of DruID and the second best-performing model.

Further, for each method, we combined all (patient, drug) pairs in the 3 test splits as described earlier.The aggregated test set had 203 samples (patient-drug pairs), with 90, 82 and 90 pairs across the three splits. This aggregated test set had 88 patients treated with Cisplatin, 58 patients treated with Paclitaxel and 57 patients treated with 5-Fluorouracil.

We also conducted an ablation study by successively removing components of the DruID architecture (Figure S3). In the first ablation, we removed the variant annotation step to obtain DruID-VA (“DruID minus VA”); instead 324-dimensional binary representations of cell lines and patients are used with each binary value indicating presence/absence of mutation(s) in the gene. We observe that DruID-VA achieves AUROC, AUPRC of 0.5715 and 0.7783 respectively, which is lower than that of DruID (0.6236 and 0.8206 respectively, paired t-test statistic 1.6884, pvalue 0.1667 for AUROC and statistic 2.7948, pvalue 0.0539 for AUPRC). In the second ablation, we also removed the zero inflated loss terms and the zero inflated layer in the unsupervised domain adaptation step. This further reduces the performance to AUROC of 0.5316 and AUPRC of 0.762 (paired t-test statistic 0.9576, pvalue 0.2197 for AUROC and statistic 0.8501, pvalue 0.2424 for AUPRC).

Copy number variation (CNV) information or gene expression data does not improve DruID performance

DruID was trained with different input data types, in this experiment. We used the cell lines and patients in the train splits of the CCLE-TCGA dataset for training and evaluated it on the patients in the test splits.

We compared the performance when using only CNV data (this was one hot encoded to indicate loss, no change and amplification), only variant annotated mutation data and a combination of the two. For combining binary CNV and real valued variant annotated mutation data, the UDA step involved the use of 2 separate ZI VAEs (ZINB for CNV and ZINorm for variant annotated mutations) per domain. The representations from both ZI VAEs were concatenated and used as the representation in the further layers of the architecture. For evaluation, the overall AUROC and AUPRC were calculated for each train-test split, by combining the predictions for each (patient, drug) pair in the test split. To check the significance of the performance using annotated mutations, we ran a t-test between AUROC and AUPRC from DruID trained using only variant annotated mutations and DruID trained using a combination of copy number variation and variant annotated mutations (second best performing model), across the 3 test splits. This yielded a p-value of 0.003 for AUROC and 0.013 for AUPRC, which indicated significant difference in performance of annotated mutations over copy number variation.

Further, for each method, we combined all (patient, drug) pairs in the 3 test splits to obtain an aggregated test split, as described earlier, to obtain the AUROC and AUPRC curves (Figure 4A).

We also compared the performance of these 3 input data types, across various drugs. For each input type, the mean AUROC and AUPRC for each drug across all 3 test splits were calculated, while plotting Figure 4B. The significance was tested by comparing the AUROC and AUPRC across the 3 test splits for each drug, between annotated mutations and a combination of annotated mutations and copy number variation using a t-test. Annotated mutations were significantly better with respect to AUPRC and AUROC for 5-Fluorouracil (p = 0.004), and for AUROC for Paclitaxel (p = 0.009).

Further, we compared the performance when using only gene expression data, only variant annotated mutation data and a combination of the two. Both data types were real valued and involved the use of ZINorm VAEs. Since some TCGA samples did not have gene expression data available, we dropped these from the test splits while comparing the performance across the various input types. For evaluation, the overall AUROC and AUPRC were calculated for each train-test split, by combining the predictions for each (patient, drug) pair in the test split. To check the significance of the performance using annotated mutations, we ran a t-test between AUROC and AUPRC from DruID trained using only variant annotated mutations and DruID trained using only gene expression (second best performing model), across the 3 test splits. This yielded a p-value of 0.007 for AUROC and 0.04 for AUPRC, which indicated significant difference in performance of annotated mutations over gene expression. Similar to CNV, we combined all (patient, drug) pairs in the test splits of the 3 train-test splits, with respect to the predicted responses from the method and the ground truth RECIST labels, to obtain the AUROC and AUPRC curves in Figure 4C.

We also compared the performance of these 3 input data types, across various drugs. The significance was tested by comparing the AUROC and AUPRC across the 3 test splits for each drug, between annotated mutations and gene expression using a t-test. Annotated mutations were significantly better with respect to AUPRC (p = 0.021) and AUROC for Cisplatin (p = 0.028). Details of the input feature vectors are in Table S19.

Validating DruID on clinical datasets

We evaluated DruID on two clinical datasets - CCLE-CRC and CCLE-OV, as described in the results section.

DruID was trained using the train splits of each dataset and evaluated on the corresponding test splits. We combined all (patient, drug) pairs in the 3 test splits (34, 33, 35 patient, drug pairs in each test split of CCLE-CRC and 32, 64, 32 patient, drug pairs in each test split of CCLE-OV dataset) as described earlier. The aggregated CRC test set had 38 patients treated with 5-Fluorouracil, 23 patients treated with Oxaliplatin and 15 patients treated with Irinotecan, resulting in a total of 76 test (patient, drug) pairs. The aggregated OV test set had 55 patients treated with Cisplatin/Carboplatin, 53 patients treated with Paclitaxel, resulting in a total of 108 test (patient, drug) pairs. The mean predicted probability of response (Figure 5D), across responders and non-responders to each drug, was calculated by passing the prediction from DruID through a sigmoid function.

Oncoplots for predicted non-responders and responders to 5FU and Cisplatin/carboplatin, across 3 train-test splits on patients in CCLE-CRC and CCLE-OV, were generated using the maftools R package67 (Figures 5E and 4F). Predicted responders were those with predicted response in the top 20th percentile of predicted responses to the drug and predicted non-responders were those with predicted response in the bottom 20th percentile of predicted responses to the drug.

Additional experiments

Modelling Varying levels of sparsity

DruID can handle not just sparse mutation data but also dense data like gene expression profiles. DruID also allows handling of different input types (real valued gene expression or binary valued mutations). Table S18 shows the sparsity levels of various input data types across the datasets used in this paper. Sparsity refers to the percentage of zeros in the input data matrix. Figure S4 shows the performance of DruID on TCGA test data, measured in AUROC across various input types, with varying sparsity levels.

Comparison across variant callers

The TCGA dataset used in our paper obtains mutations from the VarScan mutation caller. While studying the influence of various mutation callers on the results is a valuable direction for future exploration, here we identify the overlapping fraction of mutations across VarScan and a consensus mutation calling approach.39 Figure S5 indicates the count plot of fraction of overlapping mutations, in the 324 genes in FoundationOne CDx, across patients in both datasets. Although there are patients with a low overlapping fraction of mutations, the left skewed distribution, with mean overlap of 69.16% mutations denotes that there is a substantial overlap in mutations across the variant callers. 71.43% samples have more than 60% overlap.

Comparison of DRP methods on gene expression data

To show that DruID does well on gene expression data, on genes beyond the ones in the FoundationOne CDx panel, we conducted an experiment comparing DruID to other SOTA baselines, using gene expression data and response labels obtained from.2 We used the gene expression profiles provided and their corresponding response labels for this experiment, from cell lines and TCGA patients. The feature selection procedure follows that of2: the top 1000 most varied genes in terms of gene expression values, in cell lines and tumour samples separately are used to perform gene selection. The sets of 1000 genes across the two domains are combined to obtain a set of 1426 genes. The response labels were determined based on the median number of days to new tumour events. For cell lines, Z-scores were calculated across the AUDRC scores of all cell lines. These were then used to convert the AUDRC scores into binary labels (label 0 if Z-score < 0 and 1 otherwise). We considered the drugs 5-Fluorouracil, Temozolomide, Sorafenib, Cisplatin and Gemcitabine, each of which had at least 20 labelled patient samples associated and a drug Morgan fingerprint. We divided the labelled samples associated with each drug in an 80:20 ratio using 3 different random seeds to generate 3 train-test splits.

We evaluated the model performance of all baselines as well as DruID, with respect to patient response prediction. We compared the average AUROC and AUPRC scores across all 3 test folds. The mean +/- SD AUROC and AUPRC scores are reported in Table S21. DruID outperforms all other baselines in terms of AUROC and AUPRC, for 4 of 5 drugs, showing its promise in drug response prediction with gene expression data as well.

Performance of DruID on drugs unseen during training

As DruID uses drug features as inputs to the model, it can predict patient responses to drugs not seen during training. To showcase this, we trained DruID for TCGA patients, on the drugs Cisplatin, Paclitaxel, 5-Fluorouracil, Docetaxel and Cyclophosphamide, fine-tuned for 5-Fluorouracil. We created an inference dataset, from available labelled TCGA patient, drug pairs, by excluding drugs seen during training, and patients on regimen. Excluding drugs seen during training allows us to gauge how the drug embedding network in Stage III Multi-task Drug Response Prediction of DruID works on new drugs. This resulted in 27 and 16 (patient, drug) pairs with recorded response to Doxorubicin and Vinorelbine respectively. We then obtained predicted probability of response for these pairs and derived corresponding AUROC and AUPRC scores, as shown in Table S22. In both drugs, DruID achieves an AUPRC score higher than the baseline AUPRC score. Other baseline methods do not use drug information in the input and cannot be used to perform repurposing.

Performance of DruID when trained on all cancer types

In this experiment, we remove the data filtration criteria in TCGA, to include available data on all cancer types. We retain the test data splits from previous experiments and increase the number of train data points. As before, we pretrain DruID on CCLE-OV data and finetune for 5-Fluorouracil, Cisplatin and Paclitaxel. Although we have more training samples per drug, we do not see a significant improvement in AUROC and AUPRC (Table S23). Cisplatin and Paclitaxel show a drop in performance, while 5-Fluorouracil shows a slight, though insignificant improvement.

Overall, across the 3 drugs, model performance deteriorates despite increase in training data samples. This is likely due to the inclusion of extremely sparse data.

Method details

DruID Method Description

In the following sections, we describe each of the three stages of DruID(Figure 2A) in more detail.

Stage I: Variant annotations

For each point mutation in the input dataset, we generate annotations using the following tools:

1.ClinVar27 – provides clinical significance of each mutation. We group these together into 3 broad categories - pathogenic, benign and variants of unknown significance (VUS).
2.GPD28 – provides annotations based on the location of each mutation - protein information unit (PIU), linker unit (LU) and non-coding unit (NCU).
3.Annovar29 – provides annotations for each mutation that indicates if it is deleterious or not, from 17 different prediction algorithms (as shown in Table S6). These are aggregated (via mean) to calculate a d-score for each mutation.

The categories from GPD, Clinvar and the score from Annovar are shown on the top right of Figure 3A. These are aggregated to obtain gene-level features, shown on the bottom right of Figure 3A. Below we describe more details.

Processing for ClinVar, GPD and Annovar

Before annotations from ClinVar,27 Annovar29 and GPD28 can be obtained, we use TransVar,43 which takes as input a point mutation and provides the location of the mutation on the genome. The TransVar output is used by Annovar, GPD and ClinVar.

For GPD, the input is expected to be in the MAF(Mutation Annotation File) format. The TCGA dataset is directly available in this format. However, for the NUH ovarian cancer and colorectal cancer datasets, an additional processing step must be done to obtain the Variant Classification attribute. We use the Consequence field returned by TransVar, to obtain this attribute. The Consequence field indicates if the input point mutation is a missense, synonymous, nonsense, frameshift or splice site mutation. GPD also needs the Ensembl Gene ID, which we generate from Entrez Gene ID using myGene python package.44 GPD returns a category (from PIU, LU and NCU) for each point mutation based on its location.

For each point mutation, Annovar returns 17 categorical scores, one score for each of the algorithms listed in Table S6. We convert these 17 categorical values into a single score, called d-score, as follows: if x of the 17 algorithms flag a point mutation as deleterious, we set the d-score of the point mutation to be x/17. Thus after Annovar annotation, each point mutation has a d-score, whose value lies between 0 and 1. ClinVar annotation provides a clinical significance for each point mutation. Each point mutation is annotated as pathogenic, benign or as a variant of unknown significance(VUS). The mapping from ClinVar generated annotation to these 3 categories is available in Table S5.

Gene-level features

After obtaining annotations from ClinVar, Annovar and GPD, we aggregate the d-scores of the point mutations at a gene level. As shown in Figure 3A, a gene can have multiple point mutations, and each point mutation can belong to one of 3 ClinVar categories and one of 3 GPD categories. Each point mutation is also associated with a d-score. To aggregate these at a gene level, we obtain the mean, max, sum and count of point mutations present in each gene, in each of the ClinVar and GPD categories. For example, from Figure 3A, we see that the 4 point mutations in the ABL1 gene are distributed across PIU and NCU GPD categories and VUS ClinVar category. We can aggregate across all ABL1 mutations in the PIU and NCU categories, as well as all ABL1 mutations in the VUS category separately. Each gene now has 6 subcategories (GPD - PIU, LU, NCU and ClinVar - pathogenic, benign,VUS), each with 4 statistics (mean, max, sum, count), resulting in 24 features per gene per patient sample. If a gene has no mutations, it is represented as a 24-dimensional zero vector.

Stage II: Unsupervised domain-invariant representation learning

In this stage, we use Variational Autoencoders (VAE), which are unsupervised generative neural models,45 to obtain domain-invariant low-dimensional representations of the input data. One VAE is used for each domain (cell lines, patients) as shown in Figure 3B. The VAEs are trained to maximise the likelihood of the data assuming zero-inflated data distributions which allows us to model both count and real-valued data (through Zero-inflated Negative Binomial and Zero-inflated Normal distributions respectively) and with varying levels of noise and sparsity.30^,^46 The zero-inflated data distributions handle sparse nature of data. To ensure that a shared embedding space is learnt across the two domains, an alignment loss is introduced during training of the VAEs. This is achieved through the CORAL loss.47 Note that this stage does not need any drug response data from either domains and is done in an unsupervised manner. Doing so allows effective use of any patient or cell line data that has just genomic data, unlike previous approaches such as AITL, TCRP, PACE. We now formally describe the details.

Let $[eqn]$ and $[eqn]$ denote the input data associated with cell lines and patients respectively, where $[eqn]$ denotes the number of cell lines, $[eqn]$ denotes number of patients and $[eqn]$ denotes number of input features. In the VAEs used to obtain representations $[eqn]$ and $[eqn]$ for cell lines and patients respectively, the probabilistic encoders $[eqn]$ and $[eqn]$ - which learn distributions $[eqn]$ and $[eqn]$ respectively - infer the mean $[eqn]$ , $[eqn]$ and standard deviation $[eqn]$ and $[eqn]$ of the normal distributions of the latent variables. Thus, each input vector is mapped to a distribution and we use the inferred mean vectors as the latent representations $[eqn]$ and $[eqn]$ for downstream tasks. The probabilistic decoders $[eqn]$ and $[eqn]$ which learn distributions $[eqn]$ and $[eqn]$ respectively use zero-inflated distributions to model varying levels of sparsity in the reconstructed data. Let $[eqn]$ and $[eqn]$ denote the reconstructed outputs from the decoders. These are used to learn the parameters of the zero-inflated distribution ( $[eqn]$ ), using linear layers followed by relevant activation functions. In the equations below, can be or to denote cell lines or patients.

[eqn]

Each VAE is trained by minimising the negative log-likelihood of the data distribution and a KL divergence term (that acts like a regularizer). Thus, we have the combined loss term $[eqn]$ where $[eqn]$ , $[eqn]$ .

More details of the VAE construction and loss functions can be found in.30

The VAE architecture, as described above, can obtain low-dimensional dense representations of cell lines and patients. However, it does not ensure domain invariance in the representations. To achieve domain invariance, we use the CORAL loss47(Sun 2016) to minimise the difference in covariance of the domain-specific representations. If $[eqn]$ and $[eqn]$ are the covariance matrices of cell lines and patients representations $[eqn]$ and $[eqn]$ , the CORAL loss is defined as $[eqn]$ , where $[eqn]$ . Here $[eqn]$ is the number of samples in the batch, $[eqn]$ is the size of the dimensional space $[eqn]$ and $[eqn]$ is the mean of the representations, across samples in the batch. Thus the overall loss function optimised in the pretraining stage is $[eqn]$ . The hyperparameters used in this stage are listed in Table S20.

Stage III: Multi-task drug response prediction

Drug response in cell lines and patients are known to be different due to biological and environmental differences. Further, measurement of responses are different – real-valued AUDRC for cell lines and categorical RECIST scores for patients, which correspond to regression and classification tasks. To build a model that can learn from both domains and, at the same time, predict for each task separately we use multi-task learning (MTL), a well established paradigm for jointly learning models for multiple correlated tasks.

There are three inputs to the MTL model – a cell line representation, a patient representation and a drug representation. The first two are obtained from the encoders of the VAEs from stage II. To obtain a drug representation, we build another feedforward neural network, called the drug embedding network, which takes as input the drug’s binary Morgan fingerprint.48 Corresponding to each task, AUDRC prediction and RECIST prediction, we have a separate feedforward neural network. The output of the drug embedding network is concatenated separately with the cell line and patient representations from the respective encoders and passed through one of the two task-specific networks. The concatenated drug - cell line representation is passed through the AUDRC prediction network while the concatenated drug - patient representation is passed through the RECIST prediction network. Let $[eqn]$ be the input features associated with the drug and $[eqn]$ be the drug embedding from the drug network $[eqn]$ . The input to the AUDRC predictor network $[eqn]$ is $[eqn]$ where $[eqn]$ denotes concatenation. Likewise, the input to the RECIST predictor network $[eqn]$ is $[eqn]$ . It is possible to `attach’ the VAE encoders from stage I, to further train them with the rest of the network in a supervised manner; in most cases, this is found to improve performance.

The entire network (Figure 3C) is trained using two objective functions – the MSE loss for AUDRC regression and the BCE logit loss for RECIST classification. Let $[eqn]$ and $[eqn]$ denote the ground truth labels and $[eqn]$ and $[eqn]$ , the BCE logit loss and MSE loss are calculated as follows

[eqn]

[eqn]

A common approach to train MTL models is by minimising the weighted sum of the losses for each task, where the weights specify relative priorities among the tasks; this is known as the linear scalarization approach. However, when the tasks are conflicting, it may not be possible to optimise all the objectives simultaneously and trade-offs between tasks may be required. In such cases, Pareto optimal solutions, obtained through multi-objective optimization, are natural choices where each optimal solution is non-dominated, i.e., no objective value can be improved further without degrading some other objectives. The efficacy of multi-objective optimization for MTL has been demonstrated in, e.g.,.49^,^50

There can be multiple (possibly infinite) Pareto optimal solutions, represented by the Pareto front, each solution with a different trade-off between the conflicting objectives. For non-convex objective functions common in machine learning, linear scalarization cannot guarantee reaching every possible solution on the Pareto front.51^,^52^,^53 This can be guaranteed through the use of Chebyshev scalarization54 that we utilise for training our MTL network: $[eqn]$ where $[eqn]$ and $[eqn]$ are hyperparameters denoting the weights assigned to each loss term. Details of hyperparameters are listed in Table S20.

Inference

The trained network can be used to predict drug response for a given mutation profile and drug. Note that only the drug embedding network, patient-specific VAE encoder and RECIST predictor networks are required during inference as shown in Figure 3D.

Comparison of DruID with previous modelling approaches

Background on other machine learning approaches

Prior literature on drug response prediction has largely focused on cancer cell lines, such as Uno,8 GraphCDR,9 SubCDR,10 CDRScan11; we refer the reader to previous surveys5^,^7^,^12 for details. The availability of transcriptomic data in the form of gene expression, mutations, copy number variations in cell lines has resulted in a wide variety of machine learning models for drug response prediction. These methods range from linear regression and ensemble models to graph neural networks. However, DRP models trained on cell lines alone often translate poorly to patients.13^,^14^,^55 This is partly due to inherent biological differences, meaning cell lines do not accurately represent patient tumours. Cell lines are essentially a subpopulation of the primary tumour and do not exhibit heterogeneity seen in vivo. The absence of the tumour microenvironment and interactions with the host of stromal cells present in patients is also key.13^,^16 In addition, technical differences in response measurement in cell lines versus in patients, and differences in drug dosing between cell lines and patients will affect interpretation of results by a DRP model.

While omics data is increasingly available for many cancer patients,17^,^18 drug response data for these patients remains scarce and limited to standard of care therapies only. To address such challenges, transfer learning approaches including domain adaptation have been developed to train DRP models from both cell lines and patients.2^,^19^,^55 Transfer learning approaches are useful when there are limited samples available in the domain of interest (target domain), but a related domain (source domain) has a large number of labelled samples.56 have broadly grouped various transfer learning approaches based on whether the source and target domains are labelled or not. When both source and target domains are unlabeled, it is called unsupervised transfer learning. Methods which learn a shared representation space as part of pretraining, like CODE-AE,2 fall into this category. If the source domain has labelled samples but the target domain is unlabeled, it is called transductive transfer learning. Methods like TUGDA, PACE, Velodrome23^,^55^,^57 fall into this category. If both domains have labelled samples, it is called inductive transfer learning. Methods like AITL,58^,^59 TCRP,19 molecular pathway based model60 use this approach where they utilise the limited number of labelled target domain samples as well.

In all of these methods, they focus on one of two aspects that differentiate patients from cell lines - (1) distributional differences in omic profiles owing to differences in biological environment - termed “input space discrepancy”and (2) differences in the way drug response is measured - termed “output space discrepancy”.58 Input space discrepancy is handled by finding a shared embedding space that is common to both cell lines and patients. To deal with the output space discrepancy, one way is to have separate drug response prediction networks for cell lines and patients. Doing so allows both networks to learn nuances specific to each domain, in the output space. Except AITL and Velodrome, all other methods address only one of these two discrepancies. Models like,59^,^60 CODE-AE,2 PACE,57 PRECISE,13 TRANSACT,14 TUGDA23 and TCRP19 handle input space discrepancy but the same drug response prediction network is used by both the domains. The discrepancy in the output space is handled either by discretizing the cell line response based on empirically determined thresholds or by evaluating the predictions using correlation metrics. Models like Velodrome and AITL use a shared space for the domains to address input space discrepancy and use two separate prediction networks to handle discrepancies in the output space. In both cases, they train one model for each drug. When networks are trained in this manner, it becomes difficult to predict the response for a new/unseen drug since there is no trained network to perform inference with. This proves to be a challenge in the problem of drug repurposing.

State-of-the-art transfer learning methods, which evaluated their models on patient data, have largely restricted their analysis to gene expression data.23^,^55 For example, CODE-AE2 used a set of 1426 genes (selected based on percentage of unique gene expression values) and Velodrome55 used a set of 2128 genes (selected based on network propagation over a protein-protein interaction network). The genes selected in these methods are not captured based on their presence in cNGS panels; nor are the number of chosen genes comparable across cNGS and these methods. Moreover, unlabeled patient samples remain unused resulting in inefficient use of available data. Requiring transcriptomic input data represents a challenge in bringing these methods to mainstream patient care and it remains unknown if such tools can accurately predict response from the limited number of recurrently altered cancer genes that are included in cNGS panels such as FoundationOne CDx (324 genes), Tempus xF+ (523 genes), and TruSight Oncology 500 (523 genes). To the best of our knowledge, no prior transfer learning methods have been evaluated on such a restricted subset of genes. Moreover, methods which have used mutations as inputs, have not considered the variant level information captured in cNGS reports; instead they treat all alterations as equal, resulting in loss of granularity and potential reduction in predictive accuracy.

The architecture proposed above for DruID allows us to utilise unlabeled patient data, while simultaneously modelling both the distributional differences in omics inputs and differences in drug response measured across patients and cell lines. DruID also handles sparse nature of mutation data and enables the use of variant level information associated with these mutations. While some of these can be handled by extant DRP methods, DruID can perform all of these simultaneously. Table S1 shows a comparison of DruID against extant DRP methods.

DruID design choices

To address the shortcomings of prior DRP methods and the requirements listed in Table S1, we developed DruID. DruID utilises variant level information and unlabelled patient data, while modelling differences in drug responses and omics inputs across cell lines and patients. DruID was evaluated on various omics types, including mutations from cNGS panels. Unlike earlier methods, DruID can also be used to predict patient responses to drugs not seen during model training. Table S2 lists the design choices made in developing DruID, which addresses each of the aspects highlighted above.

Rationale for better performance of DruID

Although it is difficult to pinpoint the exact reasons for better performance of DruID, our experiments suggest a few potential reasons. One reason may be the inclusion of drug specific information in DruID, which has not been done in many other methods. This introduces sharing of information across multiple drugs as well. Another reason may be the ability to model the differences in response labels across patients and cell lines while also addressing the distribution differences across the mutation inputs in both domains. Use of variant annotation procedure to model granular information in sequencing reports and use of zero inflated loss terms, as evidenced by the ablation study (Figure S3) also allude to their contribution to the overall performance.

Quantification and statistical analysis

To evaluate the statistical significance of our experiments, we used ANOVA test while comparing the performance across three input sets of genes, and t-test for all other experiments (p-values and number of patient/cell line samples are reported in each of the experiment subsections, within STAR Methods).

Bibliography67

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Tsimberidou A.-M.Hong D.S.Wheler J.J.Falchook G.S.Janku F.Naing A.Fu S.Piha-Paul S.Cartwright C.Broaddus R.R.Long-term overall survival and prognostic score predicting survival: the IMPACT study in precision medicine J. Hematol. Oncol.12201914510.1186/s 13045-019-0835-131888672 PMC 6937824 · doi ↗ · pubmed ↗
2He D.Liu Q.Wu Y.Xie L.A context-aware deconfounding autoencoder for robust prediction of personalized clinical drug response from cell-line compound screening Nat. Mach. Intell.4202287989210.1038/s 42256-022-00541-038895093 PMC 11185412 · doi ↗ · pubmed ↗
3Partin A.Brettin T.S.Zhu Y.Narykov O.Clyde A.Overbeek J.Stevens R.L.Deep learning methods for drug response prediction in cancer: Predominant and emerging trends Front. Med.102023108609710.3389/fmed.2023.1086097 PMC 997516436873878 · doi ↗ · pubmed ↗
4Jia P.Hu R.Pei G.Dai Y.Wang Y.-Y.Zhao Z.Deep generative neural network for accurate drug response imputation Nat. Commun.122021174010.1038/s 41467-021-21997-533741950 PMC 7979803 · doi ↗ · pubmed ↗
5Adam G.Rampášek L.Safikhani Z.Smirnov P.Haibe-Kains B.Goldenberg A.Machine learning approaches to drug response prediction: challenges and recent progressnpj Precis. Oncol.420201910.1038/s 41698-020-0122-132566759 PMC 7296033 · doi ↗ · pubmed ↗
6Chen J.Zhang L.A survey and systematic assessment of computational methods for drug response prediction Briefings Bioinf.22202123224610.1093/bib/bbz 16431927568 · doi ↗ · pubmed ↗
7Firoozbakht F.Yousefi B.Schwikowski B.An overview of machine learning methods for monotherapy drug response prediction Briefings Bioinf.232022 bbab 40810.1093/bib/bbab 408PMC 876970534619752 · doi ↗ · pubmed ↗
8Xia F.Allen J.Balaprakash P.Brettin T.Garcia-Cardona C.Clyde A.Cohn J.Doroshow J.Duan X.Dubinkina V.A cross-study analysis of drug response prediction in cancer cell lines Briefings Bioinf.232022 bbab 35610.1093/bib/bbab 356PMC 876969734524425 · doi ↗ · pubmed ↗