Tutorial on survival modeling with applications to omics data

Zhi Zhao; John Zobolas; Manuela Zucknick; Tero Aittokallio

arXiv:2302.12542·stat.AP·March 5, 2024·Bioinform.

Tutorial on survival modeling with applications to omics data

Zhi Zhao, John Zobolas, Manuela Zucknick, Tero Aittokallio

PDF

Open Access

TL;DR

This paper provides a comprehensive workflow and tutorial for applying survival analysis techniques, especially Cox-type penalized regressions and Bayesian models, to high-dimensional omics data for identifying prognostic markers.

Contribution

It introduces a general, practical workflow and R tutorial for survival analysis with high-dimensional omics data, focusing on feature selection and model validation.

Findings

01

Effective application of Cox-type penalized regressions for high-dimensional data

02

Hierarchical Bayesian models improve feature selection in survival analysis

03

Tutorial implementation using TCGA omics data demonstrates practical utility

Abstract

Motivation: Identification of genomic, molecular and clinical markers prognostic of patient survival is important for developing personalized disease prevention, diagnostic and treatment approaches. Modern omics technologies have made it possible to investigate the prognostic impact of markers at multiple molecular levels, including genomics, epigenomics, transcriptomics, proteomics and metabolomics, and how these potential risk factors complement clinical characterization of patient outcomes for survival prognosis. However, the massive sizes of the omics data sets, along with their correlation structures, pose challenges for studying relationships between the molecular information and patients' survival outcomes. Results: We present a general workflow for survival analysis that is applicable to high-dimensional omics data as inputs when identifying survival-associated features and…

Tables3

Table 1. Table 1: An example of right-censored time-to-event (survival) outcomes, illustrated in an example where the outcome of interest is the time from disease diagnosis to death. If a patient was alive at the last observed time point, the measurement is censored at that time point.

\toprulePatient	Observed time	Status at	Factual time to death,
ID	(years)	observed time	possibly unobserved (years)
\midrule1001	11	censored	20
1002	4	dead	4
1003	5	censored	12
1004	9	dead	9
1005	1	censored	11
\botrule

Table 2. Table 2: Examples of clinical and demographic variables

\topruleVariable	Data type
\midruleSex	binary
Body mass index (BMI, $k g / m^{2}$ )	continuous
Ethnicity	nominal
Age at first diagnosis in years	integer
Pathological stage	ordinal
Therapy type (e.g. chemo-, hormone, immuno-)	nominal
\botrule

Table 3. Table 4: Cox-type supervised learning methods

\toprule Method	Feature selection via	Grouping	Uncertainty	Comment
		effects	quantification
\midrulePenalized regressions:	penalty
Lasso Cox [Tibshirani, (1997)]	$ℓ_{1}$ -norm	✗	✗
Adaptive Lasso Cox [Zhang and Lu, (2007)]	weighted $ℓ_{1}$ -norm	✗	✗	less false positives than Lasso
Elastic Net Cox [Simon et al., (2011)]	$ℓ_{1} / ℓ_{2}$ -norm	✓	✗
Group-Lasso Cox [Kim et al., (2012)]	$ℓ_{2}$ -norm	✓	✗	independent groups of features selected
Sparse Group-Lasso Cox [Simon et al., (2013)]	$ℓ_{1} / ℓ_{2}$ -norm	✓	✗
SCAD Cox [Fan and Li, (2002)]	quadratic spline and symmetric penalty	✓	✗	selection of relatively large effects
SIS Cox [Fan et al., (2010)]	top ranked variables and any penalty	✓	✗	suited to ultra-high dimensions
Bayesian models:	shrinkage prior
[Lee et al., (2011)]	Lasso (Laplace) prior	✗	✓	selection of posterior mean with a cutoff
[Lee et al., (2015)]	Elastic Net, group/fused Lasso priors	✓	✓	selection of posterior mean with a cutoff
[Konrath et al., (2013)]	Spike-and-slab prior	✗	✓
[Madjar et al., (2021)]	Spike-and-slab & MRF priors	✓	✓
[Mu et al., (2021)]	Horseshoe prior	✗	✓	selection of posterior mean with a cutoff
\botrule

Equations35

S (t) = \mathbbm P {T^{*} > t},

S (t) = \mathbbm P {T^{*} > t},

h (t) = Δ t \to 0 lim \frac{\mathbbm P { t \leq T ^{*} < t + Δ t ∣ T ^{*} \geq t }}{Δ t},

h (t) = Δ t \to 0 lim \frac{\mathbbm P { t \leq T ^{*} < t + Δ t ∣ T ^{*} \geq t }}{Δ t},

\hat{S} (t) = k = 1 T_{k} \leq t \prod K {1 - \frac{d _{k}}{Y ( T _{k} )}},

\hat{S} (t) = k = 1 T_{k} \leq t \prod K {1 - \frac{d _{k}}{Y ( T _{k} )}},

h (t ∣ X) = h_{0} (t) exp {X β},

h (t ∣ X) = h_{0} (t) exp {X β},

\frac{2}{n} ℓ (β ∣ D) - λ ∥ β ∥_{ℓ_{1}}

\frac{2}{n} ℓ (β ∣ D) - λ ∥ β ∥_{ℓ_{1}}

ℓ (β ∣ D) = lo g i = 1 \prod n {\frac{exp ( X _{i} β )}{\sum _{l \in R_{k}} exp ( X _{l} β )}}^{δ_{i}},

ℓ (β ∣ D) = lo g i = 1 \prod n {\frac{exp ( X _{i} β )}{\sum _{l \in R_{k}} exp ( X _{l} β )}}^{δ_{i}},

\frac{2}{n} lo g i = 1 \prod n {\frac{exp ( X _{0 i} β _{0} + X _{i} β )}{\sum _{l \in R_{k}} exp ( X _{0 l} β _{0} + X _{l} β )}}^{δ_{i}} - pen (β),

\frac{2}{n} lo g i = 1 \prod n {\frac{exp ( X _{0 i} β _{0} + X _{i} β )}{\sum _{l \in R_{k}} exp ( X _{0 l} β _{0} + X _{l} β )}}^{δ_{i}} - pen (β),

D E (β ∣ λ) = j = 1 \prod p \frac{λ}{2} exp {- λ ∣ β_{j} ∣}

D E (β ∣ λ) = j = 1 \prod p \frac{λ}{2} exp {- λ ∣ β_{j} ∣}

lo g f (β ∣ D, λ) = ℓ^{*} (β ∣ D) - λ ∥ β ∥_{ℓ_{1}} + C,

lo g f (β ∣ D, λ) = ℓ^{*} (β ∣ D) - λ ∥ β ∥_{ℓ_{1}} + C,

β_{j} ∣ γ_{j}, τ_{j}^{2}

β_{j} ∣ γ_{j}, τ_{j}^{2}

γ_{j}

\beta_{j}|\gamma_{j},s_{0},s_{1}\sim\gamma_{j}\mathcal{DE}\left(\beta_{j}\Big{|}\frac{1}{s_{0}}\right)+(1-\gamma_{j})\mathcal{DE}\left(\beta_{j}\Big{|}\frac{1}{s_{1}}\right).

\beta_{j}|\gamma_{j},s_{0},s_{1}\sim\gamma_{j}\mathcal{DE}\left(\beta_{j}\Big{|}\frac{1}{s_{0}}\right)+(1-\gamma_{j})\mathcal{DE}\left(\beta_{j}\Big{|}\frac{1}{s_{1}}\right).

β_{γ} ∣ γ \sim N (0, g I_{β_{γ}, β_{γ}}^{- 1}),

β_{γ} ∣ γ \sim N (0, g I_{β_{γ}, β_{γ}}^{- 1}),

β_{j} ∣ λ^{2}, τ_{j}^{2}

β_{j} ∣ λ^{2}, τ_{j}^{2}

λ_{j}

τ

C = P {S (t ∣ X_{i} (t)) < S (t ∣ X_{j} (t)) ∣ T_{i} < T_{j} & δ_{i} = 0},

C = P {S (t ∣ X_{i} (t)) < S (t ∣ X_{j} (t)) ∣ T_{i} < T_{j} & δ_{i} = 0},

ln (- ln (S_{KM} (t ∣ g))) = α + β ln (- ln (S_{model} (t ∣ g))) + ϵ,

ln (- ln (S_{KM} (t ∣ g))) = α + β ln (- ln (S_{model} (t ∣ g))) + ϵ,

BS (t) = \frac{1}{n} i = 1 \sum n {\frac{S ^ ( t ∣ t _{i} ) ^{2} \mathbbm 1 { t _{i} \leq t & δ _{i} = 1 }}{G ^ ( t _{i} )} + \frac{{ 1 - S ^ ( t ∣ t _{i} ) } ^{2} \mathbbm 1 { t _{i} > t }}{G ^ ( t )}},

BS (t) = \frac{1}{n} i = 1 \sum n {\frac{S ^ ( t ∣ t _{i} ) ^{2} \mathbbm 1 { t _{i} \leq t & δ _{i} = 1 }}{G ^ ( t _{i} )} + \frac{{ 1 - S ^ ( t ∣ t _{i} ) } ^{2} \mathbbm 1 { t _{i} > t }}{G ^ ( t )}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHealth, Environment, Cognitive Aging · Metabolomics and Mass Spectrometry Studies

Full text

\journaltitle

Journal title\DOIhttps://doi.org/10.xxxx/xxx

\appnotesReview

\authormark

Zhao et al.

\corresp

[ $\ast$ ]Corresponding author. [email protected]

†These are joint last authors

Tutorial: guidelines for survival analysis with omics data

Zhi Zhao∗1,2

John Zobolas1,2

Manuela Zucknick†1

Tero Aittokallio†1,2,3

${}^{\text{\sf 1}}$ Oslo Centre for Biostatistics and Epidemiology (OCBE), Department of Biostatistics, Faculty of Medicine, University of Oslo, 0372 Oslo, Norway

${}^{\text{\sf 2}}$ Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, 0310 Oslo, Norway

${}^{\text{\sf 3}}$ Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, 00014, Finland

(2023; 2023)

Abstract

The identification of genomic, molecular and clinical markers predictive of patient survival is important for developing personalized disease prevention, diagnostic and treatment approaches. Modern omics technologies have made it possible to investigate the prognostic impact of markers at multiple molecular levels, including genomics, epigenomics (e.g. DNA methylation), transcriptomics, proteomics and metabolomics, and how these potential risk factors complement clinical characterization of patients for survival prognosis. However, the massive sizes of the omics data sets pose challenges for studying relationships between the molecular information and patients’ survival outcomes. We present a general workflow for survival analysis, with emphasis on dealing with high-dimensional omics data as inputs when identifying survival-associated omics features and validating survival models. In particular, we focus on commonly used Cox-type penalized regressions and hierarchical Bayesian models for feature selection in survival analysis, but the framework and pipeline are applicable more generally. In cases where multi-omics data are available for survival modelling, an extra caution is needed to account for the underlying structure both within and between the omics data sets and features. A step-by-step R tutorial using The Cancer Genome Atlas survival and omics data for the execution and evaluation of survival models has been made available at https://ocbe-uio.github.io/survomics/survomics.html.

keywords:

Time-to-event data, omics, penalized regressions, sparse Bayesian models, feature selection, survival prediction, model validation

1 Introduction

Personalized medicine improves diagnosis and treatment by making use of patient-specific genomic and molecular markers that are indicative of disease development. Survival time, i.e. time to an event of interest (for example time to disease progression), is a widely-used end point and patient outcome for many diseases, and therefore it has become popular to identify genomic, molecular and clinical markers for survival prediction of patients suffering from complex diseases such as cancer. In this tutorial, we will only consider so-called right-censored survival data, where a patient has been followed for a certain time period and the event of interest is either observed in this time period or might occur at a later (as yet unobserved) time point. Right-censored survival data include both a time and the status of each patient at that time as joint outcomes, where time is a continuous variable and status is a binary variable indicating whether the event of interest has been observed up to the given time or not; in the latter case, we refer to this observation as censored at the observed time. See Table 1 for an example illustration, and more details of the survival data in Section “Data”. When using the status label as an outcome in an ordinary logistic regression, the regression coefficients will become increasingly uncertain and less reliable with increasing follow-up time (Green and Symons,, 1983). When using the observed time (or its transformation, e.g. logarithm of time) as an outcome in an ordinary linear regression, the presence of censored observations (i.e., patients still alive by the end of follow-up period) causes considerable difficulties for assessing the accuracy of predictions (Henderson,, 1995). For example, Table 1 shows an example where the proportion of patients surviving past 10 years is $1/5=20\%$ based on the observed data, but the (unobserved) factual proportion surviving past 10 years is $3/5=60\%$ .

The complexity of multiple types of genomic and molecular data poses further challenges for building effective statistical models to identify associated biomarkers that are prognostic for patient survival. For example, multi-omics profiles, such as those from mRNA expression, DNA copy number, single-point and other genetic mutations, may be available from the same patient, and these high-dimensional data come with intra- and inter-dataset correlations, heterogeneous measurement scales, missing values, technical variability and other background noise (Hasin et al.,, 2021). Bøvelstad et al., (2007) and Bøvelstad et al., (2009) compared various machine learning methods and showed improved survival prediction performance by coefficient shrinkage methods that combine several data sources, in this case clinical and gene expression data. A recent large-scale benchmarking study for survival prediction using TCGA multi-omics data demonstrated a slightly improved survival prediction performance by taking into account the group structure of multi-omics data sets (Herrmann et al.,, 2021). However, these studies did not evaluate the survival model performance with respect to feature selection, i.e. selection of most prognostic omics variables/features. Another recent study by Bommert et al., (2022) performed a benchmark of filter methods for feature selection in high-dimensional gene expression data for survival prediction, and recommended using the simple variance filter before fitting a $\ell_{2}$ -regularized Cox regression model (Simon et al.,, 2011) for accurate survival predictions and stable feature selection. Vinga, (2021) reviewed structured penalized regressions for analyzing high-dimensional omics data for survival prediction or evaluation. Multiple studies have shown that it is possible to further improve the prediction accuracy and feature selection by considering more complex structures, such as pathways, or by identifying significant features among functional relationships between the omics features (Wang et al.,, 2013; Chekouo et al.,, 2015; Kundu et al.,, 2018; Wang and Liu,, 2020; Madjar et al.,, 2021; Li et al.,, 2022).

In this tutorial, we describe a workflow for survival analysis with omics data, review the commonly used statistical methods for feature selection and survival prediction, and provide a step-by-step R tutorial using publicly available multi-omics data from The Cancer Genome Atlas (TCGA) project (http://cancergenome.nih.gov). The overall survival time, demographic and gene expression data from primary invasive breast cancer patients in TCGA (TCGA-BRCA) are retrieved from the Genomic Data Commons Data Portal data release 32.0-36.0. Compared to the previous reviews and benchmarks of survival models in bioinformatic applications, this tutorial provides a complete workflow ranging from data preparation to final calibrated models, with a particular focus on building survival models using high-dimensional (multi-)omics data as input, and as such covers both commonly used penalized regressions and Bayesian models for survival analysis with high-dimensional data.

2 Data

We use multi-omics data and overall survival outcomes from cancer patients as an example in this tutorial, but the methods are applicable also to other diseases with similar data types in personalized medicine applications. The ultimate goal of personalized medicine is to identify patient-specific risk factors to guide disease prevention, diagnostic and treatment strategies. The identification of potential risk factors for cancer patients often considers clinical, demographic, genomic and molecular information, and their associations with the patients’ time-to-event data (i.e. survival).

2.1 Time-to-event data

Time-to-event or survival data contain the event of interest (e.g., death is the event assumed in this section), together with the time from the beginning of a study period either to the occurrence of the event, end of the study, or patient loss to follow-up (i.e. right censoring; discussions of right, left and interval censoring can be found in Leung et al., (1997)). Figure 1c shows the survival or right-censored times since cancer diagnosis of patients. The exact survival time of a patient may be not observed due to censoring. Therefore, a patient has two outcome indices: censoring indicator $\delta$ (also called status) and observed time $\tilde{T}=\min\{T^{*},C\}$ , where $T^{*}$ is the exact survival time and $C$ is the censoring time. Indicator $\delta$ can also be denoted as $\delta=\mathbbm{1}\{T^{*}\leq C\}$ , where $\mathbbm{1}\{\cdot\}$ is an indicator function. To characterize the survival time of a patient, we can use survival function

[TABLE]

which gives the probability of the patient’s survival beyond time $t$ . Another useful quantity is the hazard function

[TABLE]

which is the instantaneous probability of the patient’s death at time $t$ conditional on the patient having survived up to that time point.

2.2 Clinical and demographic data

There are multiple sources of patient-level information that can be explored to identify risk factors for cancer patients. One can start with routinely collected and commonly used patient data, such as clinical and demographic variables. For example, an older male patient with a low body mass index (BMI) has a relatively high risk of gastric cancer (Nam et al.,, 2022). Table 2 illustrates selected clinical and demographic variables often available for cancer patients. Clinical and demographic variables are considered important sources of information for predicting survival and are often used to build reference models for omics-based prognostic models (Herrmann et al.,, 2021; Bommert et al.,, 2022).

2.3 Multi-omics data

Thanks to the rapid development of modern molecular biotechnology, large amounts of human genomic and molecular data have become available from many patient profiling projects. These projects often collect multiple levels of molecular information such as genomics data for DNA variation, transcriptomics data for mRNA expression, proteomics data for protein abundance and metabolomics data for metabolite processes, as illustrated in Figure 1a. Among the multiple omics levels, metabolomics is the closest to observable phenotypes, such as tumor growth and proliferation (Carins et al.,, 2011). To deeply understand the molecular biology of tumor development, integration of multi-omics data may deliver new insights into the circuits of molecular interactions that underlie the disease initiation and progression (Tarazona et al.,, 2021).

DNA-level omics data often include single nucleotide polymorphisms (SNP), DNA methylation, and somatic copy number variation, see illustration of these data types in Table 3. Each SNP feature can be coded as $\{0,1,2\}$ , according to the number of minor alleles at the given locus in each individual, i.e., AA, AG, GC or TT. DNA methylation reveals methyl groups added to the DNA molecule, which can be quantified as a $\beta$ -value $\beta=\frac{M}{M+U+a}\in[0,1]$ , where $M$ and $U$ are the fluorescence intensities of the methylated and unmethylated DNA at a CpG locus, and the offset $a$ is often set to 100 to stabilize the $\beta$ -value when $M$ and $U$ are small. Somatic copy number variation measures the number of repeated sections of the tumor genome. Table 3 shows the assumed distributions for the downstream statistical modelling. For example, DNA methylation $\beta$ -value is a proportion, often either close to 1 or 0, which can be characterized by a mixture of two beta distributions.

RNA-level omics data usually include messenger RNA (mRNA) expression and microRNA (miRNA) expression. Traditionally, DNA microarrays were used to measure the expression levels of DNA sequences called probes, which acted as a proxy for the amount of reads representing a genomic feature of interest. The reads of microarray expression level can be characterized by a negative binomial distribution directly, or a Gaussian distribution after log2 transformation (see Table 3). Nowadays, RNA sequencing (RNA-seq) has replaced DNA microarrays, since it allows for the sequencing of the whole transcriptome, while DNA microarrays only profile predefined transcripts or gene probes. miRNAs are small noncoding regulatory RNAs that play an important role in regulating gene expression and are highly evolutionary conserved in mammals (Bartel,, 2018).

Protein-level omics data usually originates from mass spectrometry (MS)-based proteomics profiling of a particular cell, tissue, organ, which can detect and measure abundance levels of entire or phosphorylated proteins. In contrast to global proteomics with MS, TCGA consortium has produced more targeted proteomics profiles, involving a set of 180-250 protein features using reverse-phase protein arrays (RPPA) (Akbani et al.,, 2014; Zhang et al.,, 2017). In contrast, The Clinical Proteomic Tumor Analysis Consortium (CPTAC) profiling has produced more than 10000 protein features using MS technology (Edwards et al.,, 2015). Protein expression data are often Gaussian distributed after log2 transformation. In global proteomics, there are often sample-specific missing values due to detection limits of protein quantification.

Metabolite-level omics data has similar statistical properties to proteomics data, and metabolomics profiling is usually done with MS-based technologies, enabling the detection and quantification of many thousands of metabolic features simultaneously. Nuclear magnetic resonance (NMR) spectroscopy is the other main analytical technology to profile metabolic processes. An important limitation of NMR spectroscopy is its relatively low sensitivity, which may lead to relatively few detected metabolites. Metabolic concentrations often follow logarithmic Gaussian distribution. Similar to large-scale proteomics, missing values due to detection limits should be treated differently than missing values due to measurement artefacts, which are more frequent in metabolite-level omics data (Sun and Xia,, 2023).

Single-cell sequencing is becoming increasingly more prevalent in many profiling studies. The modern single-cell multi-omics technologies can produce multiple levels of omics data derived from the same samples, such as transcriptomics and chromatin accessibility. The single-cell data types are similar to the illustration in Table 3, but each measurement originates from the level of individual cells. This article focuses mainly on survival analysis either with single bulk omics or multi-omics data, where each omics layer is treated independently. We briefly comment on single-cell data analysis in Section “Towards single-cell data analysis”.

2.4 Missing data

Missing values are often observed in many types of high-dimensional omics data due to various experimental reasons (Aittokallio,, 2009; Kong et al.,, 2022). For example, mRNA transcriptomics data from microarrays have 1%-10% missing values affecting up to 95% of genes due to corruption of image, scratches on the slides, poor hybridization, inadequate resolution, fabrication errors (de Brevern et al.,, 2004; Tuikkala et al.,, 2005); MS-based metabolomics data have 10%–20% missing values affecting up to 80% of variables due to lack of peak identification by chromatogram, limitation of computational detection, measurement error, and deconvolution errors (Hrydziuszko and Viant,, 2012; Sun and Xia,, 2023). The aforementioned technical reasons can lead to missing data that is either missing at random (MAR) or missing not at random (MNAR). When dealing with MNAR data, traditional imputation methods like multiple imputation may introduce bias. It is thus recommended to remove omics features which have large proportion (e.g. 50%) missingness over patients, and then apply imputation methods (e.g. $k$ -nearest neighbor imputation) for the rest of the features with missing values before doing any statistical analysis or modelling. Alternatively, imputation-free methods (e.g. mixture models) that can deal with missing values can be applied directly (Taylor et al.,, 2022). Single-cell RNA-seq (scRNA-seq) data has a vast number of zeros, so-called gene dropout events, leading highly scarce data matrices. Jiang et al., (2022) discussed the sources of biological and non-biological zeros in scRNA-seq data and the existing approaches for handling them.

3 Survival analysis with low-dimensional input data

Let us assume we have data $\mathcal{D}=\{(\tilde{T}_{i},\delta_{i},\mathbf{X}_{i}):i=1,\cdots,n\}$ for $n$ patients, where $\tilde{T}_{i}$ is the observed survival time, $\delta_{i}$ the censoring indicator and $\mathbf{X}_{i}$ contains $p$ covariates including clinical, demographic and multi-omics features. To estimate a survival function $S(t)$ given the data $\mathcal{D}$ , one needs to keep track both of the number of patients at risk and those who left the study at any time point (here we only consider the case of right censoring and assume no delayed entry). At time $t$ there are $Y(t)=\sum_{i=1}^{n}Y_{i}(t)$ patients at risk, where $Y_{i}(t)=\mathbbm{1}\{\tilde{T}_{i}\geq t\}$ is the indicator that patient $i$ is at risk. The non-parametric Kaplan-Meier (KM) estimator (Kaplan and Meier,, 1958) uses the multiplication rule for conditional probabilities to obtain an estimation of the survival function

[TABLE]

where all events happen (e.g. patients die) at $K$ distinct times, $T_{1}<T_{2}<\cdots<T_{K}$ $(K\leq n)$ and there are $d_{k}\geq 1$ events happened at the time $T_{k}$ . If no two events happen at the same time point, $d_{k}=1$ and $k=1,\cdots,n$ . The KM estimator gives an estimate of the marginal survival function, i.e., when you disregard the information from the covariates.

Figure 2a shows the KM curve for TCGA-BRCA primary breast cancer patients data. Some basic statistics can be revealed from the survival curve. For example, the estimated median survival time, i.e. the time when the survival probability is 50%, of all the breast cancer patients is 10.8 years (dashed line in Figure 2a), and 1-, 5- and 10-year survival probabilities are 0.988, 0.853 and 0.658, respectively. A log-rank test (Peto and Peto,, 1972) can be used to test whether two groups of patients (e.g. with treatment (pharmaceutical/radiation therapy) or nontreatment) have the same (null hypothesis) or different survival functions (alternative hypothesis), and provide a corresponding $p$ -value (see Figure 2b). The log-rank test can also be used to compare the survival probabilities of any subgroups of patients based on other categorical variables.

In the case where multiple clinical, demographic or omics features are available, one can first explore each variable’s association with survival outcomes separately. For a categorical variable, KM curves and log-rank tests can be used to investigate whether there is a difference between multiple survival curves categorized by the variable. For a continuous variable $X$ , the semi-parametric Cox proportional hazards model (Cox model, Cox, (1972)) is often used:

[TABLE]

where $h_{0}(t)$ is the baseline hazard function and is left unspecified. As the name “proportional hazards” implies, the Cox-model estimated hazard functions of two individuals (with different values of the covariate $X$ ) are indeed proportional, because $h_{0}(t)$ does not depend on $X$ and is thus assumed to be the same for all individuals. The functional form (1) describes the log-linear relationship between variable $X$ and the hazard $h(t|X)$ at any given timepoint $t$ . It may be difficult to satisfy the log-linear relationship based on the original scale of some omics features, e.g., gene expression data from a DNA microarray study; in those cases, the use of $\log_{2}$ -transformation of the data (Table 3) can be helpful. The Cox model can also be used to investigate the risk of a categorical variable if the assumptions above are satisfied, which provides an effect estimate (hazard ratio, HR) in addition to the $p$ -value.

To reduce the computational burden of multivariable analysis (see next Section), a heuristic is to consider preselection of a subset of omics features. A simply approach is to include omics features at a specific statistical significance level when fitting a univariate Cox model (1) for each omics feature, which would usually be higher than the commonly used 0.05 threshold to avoid losing important features, e.g. 0.1 or 0.2. However, this approach will focus on features which are independently associated with the outcome and might miss variables that are important in combination. Another quite simple approach is to preselect omics features by variance, since larger variability usually implies higher biological information. For example, one could preselect omics features explaining 50% of the total cumulative variances (Zhao and Zucknick,, 2020). Feature preselection is a recommended method to reduce dimensionality of hundreds of thousands of omics features and improve the stability of final feature selection (Bommert et al.,, 2022).

When drawing final conclusions on survival differences solely from the univariate Cox model (1) via $p$ -values, it is important to adjust the $p$ -values for multiple comparisons to be able to control false positives globally, e.g. by controlling the family-wise error rate or false discovery rates. The problem of multiple testing is beyond the scope of this article, but the interested readers are referred to a recent review article Korthauer et al., (2019). Note also that univariate analysis does not consider any relationships between multiple omics features, e.g. confounders for survival (Clark et al.,, 2003). To avoid making seriously misleading conclusions in such cases, it is necessary to perform multivariable survival analysis (Bradburn et al.,, 2003).

4 Survival analysis with high-dimensional input data

Multi-omics data can have hundreds of thousands of variables measured at various molecular levels, which challenges classical multivariable regression models for time-to-event endpoints, since the number of variables is larger than the number of patients (i.e. $p\gg n$ ). To proceed with survival analysis, one option is to reduce the dimension of the multi-omics features via unsupervised learning, and then investigate the association of the reduced low-dimensional variables with survival outcomes. An alternative approach is to directly use supervised learning methods, such as penalized regressions or sparse Bayesian models (Table 4), which enable the modelling of high-dimensional multi-omics features, and the selection of a few important features associated with survival outcomes.

4.1 Unsupervised learning

Unsupervised learning methods aim to identify hidden patterns or data groupings, and are for example useful when a phenotype (e.g. a cancer type) is to be divided into several subtypes (e.g. to explain heterogeneity among patients). For example, breast cancer has been traditionally categorized into five conceptual molecular classes, originally using pairwise average-linkage cluster analysis of DNA microarray data, to better understand tumor biology and guide clinical decision making (Perou et al.,, 2000; Sørlie et al.,, 2001). Unsupervised methods learn underlying patterns from unlabeled data by transforming high-dimensional omics features into a lower dimensional space. Principal component analysis (PCA) is a classical multivariate technique that represents high-dimensional features in a low-dimensional space by building orthogonal (uncorrelated) linear combinations of the features that best capture the variance in the high-dimensional data (Pearson,, 1901). Figure 3a shows the first two principal components of PCA using the RNA-Seq data of breast cancer patients from TCGA.

Different from the distance-based PCA with linear transformation, non-linear techniques have recently emerged, such as $t$ -stochastic neighbour embedding ( $t$ -SNE). $t$ -SNE uses pairwise similarities of individuals based on Kullback-Leibler divergence to project similarities into a lower dimensional space, ensuring that individuals with similar omics features are close in the generated embedding (van der Maaten and Hinton,, 2008; van der Maaten,, 2014), see Figure 3b. The focus of $t$ -SNE is on preserving local distances between neighbouring data points. Another non-linear alternative to PCA is UMAP (Uniform Manifold Approximation and Projection), which is a general dimension reduction method built upon Riemannian geometry (McInnes et al.,, 2018). Compared to other non-linear methods of dimension reduction such as $t$ -SNE, UMAP can sometimes provide better visualization quality in a shorter amount of time, while preserving the global structure of the omics data better, see Figure 3c.

Unsupervised methods only make use of the input data matrix $\mathbf{X}$ and are agnostic to the survival information. To identify omics features as risk factors associated with survival outcomes, a straightforward approach is to use a few representative components from PCA, $t$ -SNE or UMAP as covariates in a multivariable Cox model. An alternative is to use semi-supervised methods (Bair and Tibshirani,, 2004) that combine the clustering procedure and survival modelling together. However, these approaches lose interpretability of the individual omics features, since each component is a linear or nonlinear combination of all omics features.

4.2 Supervised learning via penalized regressions

For the purpose of personalized cancer medicine, one is typically interested in identifying risk factor combinations from clinical and omics features. These factors can be targeted (directly or indirectly) via therapeutic strategies or used for diagnostics. Therefore, the objective is to identify a parsimonious set of features linked to survival outcomes by utilizing the wealth of information present in, for example, the vast amount of available omics data. Penalized Lasso Cox regression (Tibshirani,, 1997) can be used to select a few relevant omics features by estimating their coefficients as non-zero (the non-relevant features’ coefficients are shrunk to zero) via maximizing the penalized partial log-likelihood function of the regression coefficients with $\ell_{1}$ -norm penalty

[TABLE]

Here, $2/n$ is a scaling factor for convenience, $\mathcal{D}=\{(\tilde{T}_{i},\delta_{i},\mathbf{X}_{i}):i=1,\cdots,n\}$ , $\mathbf{X}_{i}$ includes $p$ (omics) features of the $i$ -th patient, $\lambda$ is a tuning parameter to control the overall penalty strength of the coefficients, $\|\bm{\beta}\|_{\ell_{1}}=\sum_{j=1}^{p}|\beta_{j}|$ , and the partial log-likelihood is

[TABLE]

where $\mathcal{R}_{k}=\{l:Y_{l}(T_{k})=1\}$ is the risk set at time $T_{k}$ . The Elastic Net Cox model (Simon et al.,, 2011) considers both the Lasso feature selection and the grouping effect of correlated omics features in ridge regression via a combination of the $\ell_{1}$ - and $\ell_{2}$ -norm (i.e. $\ell_{1}/\ell_{2}$ -norm) penalty $\lambda\left\{\alpha\|\bm{\beta}\|_{\ell_{1}}+\frac{1}{2}(1-\alpha)\|\bm{\beta}\|_{\ell_{2}}^{2}\right\}$ (where $\|\bm{\beta}\|_{\ell_{2}}^{2}=\sum_{j=1}^{p}|\beta_{j}|^{2}$ ) , which can usually improve the prediction performance over the Lasso Cox model. Figure 4 shows an example of Elastic Net Cox model feature selection from gene expression features associated with breast cancer patients’ survival. Note that often we wish to include a small set of well-established clinical risk factors in a survival model. Since they are established as important covariates, they can be included in the Lasso or Elastic Net Cox model as mandatory covariates without penalization. Then the penalized partial log-likelihood function becomes

[TABLE]

where $\bm{\beta}_{0}$ are coefficients corresponding to the $i$ -th individual’s mandatory covariates $\mathbf{X}_{0i}$ , pen( $\bm{\beta}$ ) is a $\ell_{1}$ - or $\ell_{1}/\ell_{2}$ -norm penalty for feature selection of omics features $\mathbf{X}_{i}$ .

There are many alternative penalties that will achieve feature selection which have been applied to Cox proportional hazards regression for survival outcomes, such as the Adaptive Lasso Cox model that incorporates different penalties for different coefficients to retain important variables (Zhang and Lu,, 2007), Group-Lasso that performs group selection on (predefined) groups of variables (Kim et al.,, 2012), Sparse Group-Lasso that introduces sparse effects both on a group and within group level (Simon et al.,, 2013), the smoothly clipped absolute deviation (SCAD) Cox model that overcomes substantially biased estimates for large coefficients in ultra-sparse models (Fan and Li,, 2002), sure independence screening (SIS) procedure in combination with Cox model that speeds-up the feature selection dramatically and can also improve the accuracy of estimation when dimensionality becomes ultra-high, i.e., $\log(p)=\mathcal{O}(n^{\xi})$ for some $\xi>0$ (Fan et al.,, 2010). However, all these penalized Cox models do not directly provide uncertainty of feature selection or survival prediction. One empirical way for uncertainty quantification is through additional resampling-based methods, see Section “Survival model validation” for more details.

4.3 Supervised learning via Bayesian priors

Bayesian inference is an appealing approach for survival analysis due to its ability to provide straightforward uncertainty quantification (e.g. credible intervals) of parameters and survival probabilities. For instance, Lee et al., (2011) proposed a Bayesian version of the Lasso Cox (Bayesian Lasso Cox) model that provides credible intervals of coefficients fairly straightforward (see Figure 5), but it is not easy to derive confidence intervals of coefficients in a Lasso-type model. In general Bayesian regression models, one is interested in the posterior density $f(\bm{\beta}|\mathcal{D},\lambda)$ , which is proportional to the product of the prior density and the likelihood function. In the Bayesian Lasso Cox model, all the coefficients are assigned to the same double exponential (DE, also known as Laplace, Figure 6a) prior

[TABLE]

and a simple version of the derived log posterior density is

[TABLE]

where $\ell^{*}(\bm{\beta}|\mathcal{D})$ is the full log-likelihood function i.e. $\log(\int_{0}^{t}h_{0}(s)ds)+\ell(\bm{\beta}|\mathcal{D})$ , $C$ is a normalizing constant independent of $\bm{\beta}$ and the $\ell_{1}$ -norm penalty tends to choose only a few nonzero coefficients, which is similar to (2) in the frequentist Lasso Cox model. Markov chain Monte Carlo (MCMC) sampling can be performed for posterior inference of $\bm{\beta}$ . Note that instead of using the partial log-likelihood $\ell(\bm{\beta}|\mathcal{D})$ in (3), a full log-likelihood function is used, which includes the baseline hazard function. This can be achieved, for example, by assigning a stochastic process prior, e.g. a gamma process, for the cumulative baseline hazard function. More details about the prior setup and inference can be found in Lee et al., (2011). Lee et al., (2015) extended the Laplace prior to Elastic Net prior, fused Lasso prior and group Lasso prior, which are often more suitable for correlated omics features in survival analysis. But Lee et al., (2011) and Lee et al., (2015) assign the same shrinkage priors to all covariates indiscriminately, see Zucknick et al., (2015) for an extended Bayesian Lasso Cox model which permits the use of mandatory covariates.

Note that the Bayesian version of the Lasso with Laplace-type priors in practice do not result in automatic feature selection, because only the posterior modes of the coefficients are equivalent to the frequentist Lasso solution, while in Bayesian inference one usually focuses on the posterior means as point estimates. As an alternative, a particular omics feature can be excluded if the estimated credible interval of the corresponding coefficient covers zero. For example, Figure 5 indicates the selected variables as age, MELK, ORC6L, CDC20, KRT17, FOXC1, ESR1 and MMP11.

Stochastic search variable selection (SSVS) is an alternative approach to identify important covariates (George and McCulloch,, 1993; Konrath et al.,, 2013). SSVS uses independent spike-and-slab priors for regression coefficients, e.g.

[TABLE]

where $\gamma_{j}$ ( $j=1,\cdots,p$ ) is a latent variable (which can have a Bernoulli prior with a fixed probability $\pi$ ) for feature selection indicating $\beta_{j}\neq 0$ if $\gamma_{j}=1$ and $\beta_{j}=0$ if $\gamma_{j}=0$ , $\tau_{j}^{2}$ is an additional shrinkage parameter which can be assigned with an additional prior (e.g. exponential or inverse gamma prior), and $\delta_{0}(\cdot)$ is the Dirac delta function. Figure 6b shows the two components of the spike-and-slab mixture distribution. Tang et al., (2017) assigned a spike-and-slab mixture double-exponential (i.e. Laplace) prior for Cox model $\beta_{j}$ , i.e.

[TABLE]

A expectation maximization coordinate descent algorithm was implemented to optimize the joint posterior distribution of parameters, which provides the posterior modes of regression coefficients and corresponding frequentist confidence intervals (Tang et al.,, 2017). Recently, Madjar et al., (2021) proposed graph-structured feature selection priors for Cox model by assigning a Markov random field prior on the latent variables, in which the graph helps to identify pathways of functionally related genes or proteins that are simultaneously prognostic in different patient cohorts. Formulation (4) implies independence between the priors of the individual $\beta_{j}$ in the slab component. In contrast, a variant of the spike-and-slab prior has a $g$ -prior slab (Zellner,, 1986; Held et al.,, 2016)

[TABLE]

where $\bm{\beta}_{\bm{\gamma}}=\{\beta_{j}:\gamma_{j}=1,j=1,\cdots,p\}$ , $\bm{\gamma}=\{\gamma_{j}:j=1,\cdots,p\}$ , $g$ is either a scalar estimated by Empirical Bayes or assigned with additional prior, and $\mathcal{I}_{\bm{\beta}_{\bm{\gamma}},\bm{\beta}_{\bm{\gamma}}}^{-1}$ is the expected Fisher information for $\bm{\beta}_{\bm{\gamma}}$ .

Another popular shrinkage prior is the horseshoe prior, a continuous and global-local shrinkage prior, in which the global parameters allow for sharing information across omics features and the local parameters allow for adjustments at the individual omics feature level (Carvalho et al.,, 2009; Mu et al.,, 2021). In a similar setup to the Cox model with spike-and-slab priors in (4), a horseshoe prior for the regression coefficient is

[TABLE]

where the local parameter $\lambda_{j}$ and global parameter $\tau$ are both half-Cauchy distributed $\text{C}^{+}(\cdot,\cdot)$ . With the horseshoe prior, the posterior mean of $\beta_{j}$ will be shrunk by a weight $\kappa_{j}=\frac{1}{1+\lambda_{j}^{2}}\in(0,1)$ as in Figure 7, where $\kappa_{j}\to 0$ induces $\beta_{j}\to 0$ . Using a user-adjustable cutoff value, many coefficients can be shrunk to zero, enabling the selection of only a few associated omics features (with non-zero coefficients).

Although Bayesian models can quantify uncertainty of estimators more straightforward than penalized regressions, most Bayesian Cox-type models for high-dimensional covariates do not have user-friendly and standalone R packages on CRAN or GitHub. The main reason is the high computational cost of running a high-dimensional Bayesian Cox model. Advanced users with programming capabilities can contact the corresponding authors for original scripts. Since Bayesian priors are more flexible than frequentist Lasso-type penalties, it can be easier to extend Bayesian models by changing shrinkage priors while keeping almost the same algorithm framework. This means that the Bayesian framework can be more suitable if one is interested in tailoring the shrinkage effects, e.g. to include prior knowledge about the importance of omics features, for example features corresponding to a molecular pathway that is known to be affected in the disease under study (Zhao et al.,, 2023). For more information on different Bayesian priors in cancer prognosis, we suggest a recent review by Chu et al., (2022) which summarized other different shrinkage priors on regression coefficients, such as Gaussian-gamma, Gaussian, Cauchy, pMOM (product moment distribution), piMOM (product inverse moment distribution) and peNMIG (parameter-expanded normal-mixture-of-inverse-gamma distribution) priors.

5 Survival model validation

Model validation plays an important role in identifying potential issues, such as model misspecification or overfitting. This is achieved by revisiting the model’s specifications and assumptions following model estimation. For example, the Cox model (1) requires proportional hazards and the logarithm of the hazard to be linear with respect to the model covariates. The former assumption can for example be checked by the cumulative Schoenfeld residuals (Grambsch and Therneau,, 1994; Scheike and Zhang,, 2011), and the latter assumption by plotting a nonlinear functional form (e.g. spline) for the effect of a covariate. If the Cox model assumptions are not satisfied, one can try certain transformations of covariates (e.g. Box-Cox power transformations, Box and Cox, (1964)), allow time-varying coefficients or model interactions among covariates (Lin et al.,, 2016; Ng et al.,, 2023), or investigate patient stratification using unsupervised approaches (Garman et al.,, 2008). However, the assumption checks are usually suitable only for low-dimensional models, i.e., for a few clinical variables or a few factors projected from the high-dimensional omics feature space. Johnson and Long, (2011) used heuristic methods to investigate the Cox model assumptions by separately fitting univariate Cox models one feature at one time and testing with the cumulative Schoenfeld residuals correspondingly. An alternative approach is to loosen the model assumptions for a more robust modelling approach. One example developed specifically for feature selection in a high-dimensional space is concordance regression (Dunkler et al.,, 2010; Pang et al.,, 2012).

5.1 Feature stability analysis

One important aspect in model validation when using omics or other high-dimensional data is the potential instability of feature selection (Kalousis et al.,, 2007). Feature selection using penalized regressions as described in Section “Supervised learning via penalized regressions” heavily depends on the values of the penalty parameters (e.g. for the $\lambda$ parameter in Lasso Cox model (2)). The penalty parameters are often optimized by cross-validation (CV) or other resampling methods, and the uncertainty associated with the random selection of subsets may result in uncertainty in the feature selection, e.g. different CV folds will typically result in different selected features. A straightforward way to identify the most stable features is to find the overlap of identified omics features among different data subsets (e.g. CV folds or resamples) to avoid high false discovery rate (Zucknick et al.,, 2008; Zhao and Zucknick,, 2020). One can also perform stability selection (Meinshausen and Bühlmann,, 2010), which allows to select the most stable features at a given Type-I error level for a Lasso or Elastic Net Cox model (Sill et al.,, 2014).

For the Bayesian models in Section “Supervised learning via Bayesian priors”, feature selection stability is naturally assessed by the uncertainty of coefficients’ estimators, as reflected in the posterior variances of $\beta_{j}$ or the posterior selection probabilities $p(\gamma_{j}|\mathcal{D})$ (in SSVS), which is a natural benefit of utilizing full Bayesian inference. Although the uncertainty in feature selection introduces increased variability in the predicted survival probabilities, in the Bayesian framework, this can be addressed quite naturally by averaging the survival predictions over all models using Bayesian model averaging (Volinsky et al.,, 1997). If one is interested in a single model, rather than model averaging, the median probability model (Barbieri and Berger,, 2004) can be used for uncertainty analyses in survival and high-dimensional omics data (Madjar et al.,, 2021).

5.2 Survival prediction and calibration

To confirm that the identified clinical and omics features have prognostic power with respect to the prediction of patients’ survival outcomes, a model should be both accurate (low prediction error) and precise (low prediction uncertainty). The simplest way to demonstrate the survival prognostic power is to dichotomize the prognostic scores (i.e. linear predictor $\mathbf{X}_{i}\bm{\beta}$ in the Cox model) by its median value, and then use a log-rank test to compare the survival probabilities of the patients in the two groups, see Figure 2b. Similarly, one can categorize the prognostic scores by multiple quantiles (e.g., 25%, 50% median, and 75%) into multiple groups of patients and perform a log-rank test.

To validate a prediction model systematically (Rahman et al.,, 2017), the predictive performance of the model is commonly addressed by

•

discrimination: the ability of the model to distinguish between low and high risk patients,

•

calibration: the agreement between the observed and predicted survival probabilities, and

•

overall performance: the distance between the observed and predicted survival probabilities.

5.2.1 Discrimination performance

If one focuses on survival prediction at a fixed time point (e.g. 5-year survival probability), a receiver operating characteristic (ROC) curve can be used to evaluate the prognostic (i.e. prediction or discrimination) ability of the survival model, often summarized by its area under the ROC curve (AUC) (Heagerty et al.,, 2000), see Figure 10a for an example. An AUC of 0.5 is equivalent to the toss of a coin, and the closer the AUC is to 1, the more predictive is the model. When making predictions at multiple time points, ROC curves can be summarized as time-dependent AUC scores, i.e. AUC scores calculated at prespecified time points. Alternatively, the concordance index (C-index) provides a more global, time-independent assessment of the discrimination ability of a prognostic model, such that a better model predicts higher prognostic scores for patients with shorter survival times (Harrell et al.,, 1982), i.e.

[TABLE]

which means in the absence of censoring, any pair of individuals $\{i,j\}$ with survival times $T_{i}<T_{j}$ is concordant if and only if $S(t|\mathbf{X}_{i}(\mathbf{t}))<S(t|\mathbf{X}_{j}(\mathbf{t}))$ (equivalent to ranking the prognostic scores $\mathbf{X}_{i}\bm{\beta}>\mathbf{X}_{j}\bm{\beta}$ in a Cox model), where $\mathbf{t}$ denotes the time instants where there are covariate variations. The C-index can be expressed as a weighted average of the time-dependent AUC over time (Heagerty and Zheng,, 2005). Therefore, its interpretation is similar to the AUC, where a C-index of 0.5 indicates random predictions, while a perfect prognostic model would have a C-index of 1. There are multiple types of C-indices for survival modelling, in particular the most frequently used Harrell’s (Harrell et al.,, 1982) and Uno’s C-index (Uno et al.,, 2011). Uno’s C-index is more robust than Harrell’s C-index, in case there is dependence of the concordance on the study-specific censoring distribution (Uno et al.,, 2011).

In the classical Cox modelling framework, both Harrell’s and Uno’s C-indices only depend on the linear predictors $\mathbf{X}_{i}\bm{\beta}$ , which is independent of $t$ . But if a model includes covariates with time-dependent effects $\bm{\beta}(t)$ and/or time-dependent covariates $\mathbf{X}_{i}(t)$ , Harrell’s and Uno’s C-indices are difficult to be calculated, since they require the calculation of survival functions for each individual over time. In this context, Antolini et al., (2005) proposed a time-dependent C-index, which assesses the concordance of a model’s survival distribution predictions $\hat{S}(t)$ . This means that Antolini’s C-index requires the full specification of $\hat{S}(t)$ , even though a C-index only compares the survival probabilities between any pair of individuals, that is, it only assesses whether the relative order of estimated survival probabilities is concordant with observed survival outcomes (Blanche et al.,, 2019). Time-dependent prediction indices can better evaluate a model including candidate features with time-dependent effects and/or time-dependent features. To avoid C-hacking among different C-indices in model comparison, Sonabend et al., (2022) recommended that if all models make survival distribution predictions, then select a time-dependent C-index; otherwise choose a time-independent measure (e.g. Uno’s C-index); if there is a combination of risk- and distribution-predicting models, then choose a transformation method for analysis (e.g. expected mortality).

5.2.2 Calibration performance

Calibration is to quantify the agreement between the observed and predicted outcomes, which is useful for both internal and external model validation and is recommended to report routinely. The calibration slope is commonly used (van Houwelingen,, 2000), which is the slope of the regression of the observed/actual survival probabilities on the model-predicted survival probabilities. A survival model can be reported with the estimated $t$ -year survival probabilities in predefined subgroups, denoted as $S_{\text{model}}(t|g)$ for subgroups $g=1,\cdots,G$ . The observed/actual $t$ -year survival probabilities in the subgroups can be estimated by the KM method, denoted as $S_{\text{KM}}(t|g)$ . Using the $\ln(-\ln(\cdot))$ -link, the calibration model is

[TABLE]

where $\epsilon$ is an error term. If the intercept $\alpha=0$ and the slope $\beta=1$ , it means that the survival prediction model is well calibrated. For example, Figure 10b shows a calibration plot, visualizing the calibration of the estimated 5-year survival probabilities (with 95% confidence interval by bootstrapping) using the KM method for TCGA-BRCA patients grouped by the quartiles of Cox-model predicted survival probabilities. Furthermore, one can calibrate a Cox model in terms of the baseline cumulative hazard and prognostic score. For non-proportional hazard models, calibration using the model cumulative hazard function can be considered.

As an alternative to the calibration slope at a single time point, Andres et al., (2018) and Haider et al., (2020) suggested the distributional (D)-calibration for accounting survival probabilities across all time points. This can be useful when assessing the entire post-treatment survival prognosis, for example, assessing post liver transplantation survival utility in Andres et al., (2018).

5.2.3 Overall performance

Scoring rules can evaluate the accuracy and confidence of probabilistic predictions, and assess both discrimination and calibration (Gneiting and Raftery,, 2007; Avati et al.,, 2020). The idea of scoring rules dates back to Brier, (1950) which assigned a numerical score for verifying ensemble-based probabilistic predictions of discrete outcomes.

Graf et al., (1999) proposed the time-dependent Brier score, which is the expected mean-squared error of survival probability prediction at different time points (Figure 8), i.e.

[TABLE]

where $t_{i}$ is the survival time of $i$ -th individual, $\hat{S}(t|t_{i})$ is the Cox-model predicted survival probability and $\hat{G}(t)$ is the KM estimate of the censoring distribution. The benefit of the Brier score is that it does not only measure discrimination, similar to evaluation measures like the C-index, but also calibration performance of a survival model. The integrated Brier score (IBS) is as a single measure of prediction accuracy integrating BS( $t$ ) over an entire follow-up time period. Hielscher et al., (2010) presented a comparison between the IBS and a $D$ measure (Schemper and Henderson,, 2000), which is an integrated measure based on the mean absolute deviation rather than the mean-squared error used in IBS. The $D$ measure is more robust towards extreme observations and has a smaller variance than the IBS.

To overcome potential overfitting when using feature selection and model estimation, the survival predictions and model calibration should be evaluated in an independent validation data set. As validation data are seldom available, it is recommended to use resampling-based methods for estimating the uncertainty of the survival model’s performance (Sill et al.,, 2014). This can be done for example by repeatedly splitting the data to training and validation sets, and evaluating a survival model’s performance on the different validation sets using various discrimination or calibration indices. The .632+ bootstrap estimator for a discrimination or calibration index (Figure 8) can balance the apparent (training) error and the average out-of-bag bootstrap error, and in addition accounts for the relative overfitting based on a no-information error rate in high-dimensional settings (Schumacher et al.,, 2007; Binder and Schumacher, 2008a, ).

5.2.4 Graphical representation

After confirming that a model is valid (assumptions hold), accurate (low prediction error), precise (uncertainty of performance measures properly quantified) and its predictions generalizable beyond the training data set (using independent validation data if available), a prognostic nomogram (Kattan et al.,, 1999) can be used to summarize the prognostic effect of the identified clinical and omics features on the risk of a specific year’s overall survival (Figure 9), which may help the clinicians to enhance the patient management and personalized treatment strategies. For example, the red colored dots in Figure 9 show the information of the identified five variables from an example patient and the corresponding scoring points. The summed scoring points of 263 maps to the predicted 1-year, 3-year and 5-year survival probabilities of this patient. Note that most nomograms treat the identified variables independently in the risk calculation, even though there may be significant interactions among the model features that were used in the feature selection step. However, visualizing such interaction effects would make the nomograms less accessible and interpretable, and so, there is still a room for improvement in how to translate the multi-variate risk scores into clinical practice.

When an independent validation data set is available, it is recommended to report a calibration plot corresponding to the nomogram. Using independent validation data to obtain $S_{\text{model}}(t|g)$ in the calibration model (5) is for the generalization capacity of the model. Since we here do not have independent validation data besides TCGA-BRCA data, Figure 10 shows an example calibration plot at 5-year survival evaluation time point based on the built Cox model in Figure 9 for a split 20% TCGA-BRCA data set.

6 Beyond penalized and Bayesian Cox models

In this tutorial, we mainly focused on penalized regressions and Bayesian hierarchical models in the Cox proportional hazards framework. One can extend this framework in several ways. For instance, stay in a likelihood-based modelling framework, but replace the partial likelihood function of the semi-parametric Cox model by alternative likelihood functions (which do not necessarily need to imply proportional hazards), e.g. for parametric survival models like exponential, Weibull, or accelerated failure time (AFT) models, or for Aalen’s additive hazard model (Gorst-Rasmussen and Scheike,, 2012; Wang et al.,, 2019). Alternatively, one can move to a more algorithmic machine learning approach, such as tree-based boosting or bagging methods, e.g. random survival forests (Hothorn et al.,, 2006; Ishwaran et al.,, 2008; Binder and Schumacher, 2008b, ; Jaeger et al.,, 2019; Hornung and Wright,, 2019), and (deep) neural networks (Wiegrebe et al.,, 2023).

Hothorn et al., (2006) and Ishwaran et al., (2008) introduced ensemble tree methods for analyzing right-censored survival data, which construct ensembles from base learners, e.g., binary survival trees for each omics feature. Hothorn et al., (2006) also proposed a gradient boosting algorithm to predict the survival time of patients with acute myeloid leukemia (AML), based on clinical and omics features. Similarly, Binder and Schumacher, 2008b developed a likelihood-based boosting method, which aims to maximize the Cox partial likelihood function, for modelling time-to-event data based on high-dimensional omics input data and which also allows the inclusion of a small number of mandatory covariates. In general, one needs to be cautious if using some machine learning methods that are not well-suited for high-dimensional features. For example, Kvamme et al., (2019) proposed extensions of the Cox model with neural networks, which are only valid if the number of covariates is smaller than the number of samples, i.e. if $p<n$ . A systematic review of deep learning for survival analysis, which includes a survey of methods suitable for high-dimensional data ( $p>n$ ), is provided by Wiegrebe et al., (2023).

In the case of non-proportional hazards, many likelihood-based survival models beyond Cox-type models have also been extended to account for high-dimensional omics as input data. For example, Ma et al., (2006) combined Lin and Yin’s additive hazard model (Lin and Ying,, 1994) with principal component regression for dimension reduction of omics features, which was applied to the study of gene expression-based survival prognosis for diffuse large B-cell lymphoma. Engler and Li, (2009) added the Elastic Net penalty in an accelerated ATF model, which assumes that the effect of a covariate accelerates or decelerates the life course of patients. Schmid and Hothorn, (2008) and Barnwal et al., (2022) used boosting algorithms to learn parametric AFT models. Ha et al., (2014) considered the Lasso, SCAD and a penalized h-likelihood for feature selection in frailty models which assume that individuals have unobserved heterogeneity captured by a latent random term $Z$ , which adapts the Cox model (1) into $h(t|X)=Zh_{0}(t)\exp(X\beta)$ .

6.0.1 Advanced survival models: cure models, competing risks and multi-state models

In some situations, survival data may be different from Figure 1c (also Section “Time-to-event data”), where it was presumed that all individuals will eventually experience the event of interest. Liu et al., (2012) studied the Lasso and SCAD feature selection for the proportional hazard mixture cure model, in which a certain fraction of individuals will never experience the event of interest. Tapak et al., (2015) investigated Lasso, Elastic Net and likelihood-based boosting for microarray-based survival modelling with competing risks, such as “progression” versus “death from non-cancer cause”, i.e., the event of a patient can occur due to one of multiple distinct causes. For a single individual who can experience several possible events, Dutta et al., (2016) proposed a multi-state model to identify risk factors in different stages of disease based on high-dimensional input data.

7 Towards single-cell data analysis

The cellular heterogeneity of complex sample mixtures pose challenges and also opportunities for precision medicine and survival prediction. For example, Zhou et al., (2019) showed that tumor microenvironment-related gene expression signatures do not only accurately predict the survival among colon cancer patients, but also serve as biomarkers for identifying patients who could potentially benefit from adjuvant chemotherapy. scRNA-seq technologies provide an unprecedented opportunity for dissecting the interplay between the cancer cells and the associated tumor microenvironment, and the produced high-dimensional data should also augment existing survival modelling approaches. The emerging single-cell atlas will provide a detailed and quantitative overview of tissue composition and organization, and advance both biomedical research and clinical practice (Elmentaite et al.,, 2022).

Currently, the focus of statistical model development for single-cell data analysis is to understand cell type composition and its impact on gene regulation and transcriptional dynamics, usually based on only a small number of samples/individuals. On the one hand, the underlying statistical models for single-cell data analysis are still in development and continuously being re-evaluated and challenged (Andrews et al.,, 2021; Kharchenko,, 2021). On the other hand, survival analysis tackles disease or health problems on the individual level, and usually requires a relatively large number of individuals for sufficient statistical power. In particular, the development of improved computational methods is urgently needed to enable the consideration of multiple layers (e.g. individual-level, cellular-level, molecular-level) of information when integrating groups of individuals and omics data from a variety of molecular modalities. Therefore, current approaches often map the findings from single-cell omics data to large-scale bulk sequencing omics and survival data, rather than jointly analyzing single-cell omics and survival data. For example, Guo et al., (2018) introduced a generalizable approach to first study T-cells in 14 non-small cell lung cancer (NSCLC) patients and to identify gene signatures from the tumor-enriched T cell clusters, and then investigated these gene signatures using bulk RNA-seq and survival data from larger TCGA-NSCLC patient cohort. Similarly, Zhang et al., (2020) developed a scRNA-seq-based approach to reconstruct a multilayer signaling network based on 16 glioma patients, and then investigated the network genes using survival data from TCGA Chinese glioma genome atlas (TCGA-CGGA) patients. However, direct joint analysis of survival and single-cell omics data from multiple cellular hierarchies requires further methodological developments and new statistical and machine learning methods.

8 Discussion

Although survival analysis faces many modelling challenges, mainly due to censored outcomes, it represents a well-established methodology for finding risk factors associated with patients’ survival. The identification of omics biomarkers for survival prognosis may provide systematic means to guide patient management and personalized treatment and diagnostic strategies. In this tutorial, we provided a comprehensive workflow for survival analysis with omics and clinical data, with a specific focus on feature selection of survival-associated omics features and survival model validation. We covered many penalized regressions and Bayesian models for feature selection and survival prediction, accounting for their specific assumptions and applications. Examples of real data and R scripts have been made available to illustrate the use of different methods, which should help researchers to choose and apply suitable methods for their survival analysis applications (https://ocbe-uio.github.io/survomics/survomics.html). We note that this review only considers methods for right-censored time-to-event data, i.e. where all individuals are assumed to be followed continuously from the start of the study, but where the follow-up period might end before the event (e.g. death) was observed. Other types of censoring include interval censoring and left truncation, and appropriate statistical methods dealing with these censoring patterns should be chosen accordingly.

Most of the current methods for survival analysis do not take into account the complex structures within and between multi-omics data, such as gene regulation and DNA-protein interactions. Regulatory networks constructed either based on prior biological knowledge or using data-driven, yet biologically explainable approaches, may help establish useful methodologies for survival analysis that are more effective for deriving biological insights as well as enable improved clinical translation.

Another limitation of most of the reviewed methods is that they identify omics features predictive of survival, but they cannot determine causal relationships. Causality is a fundamental notion to understand omics features causing disease progression, which will allow one to reliably intervene omics features for targeted therapies. There are two popular models, Pearl’s structural causal model (SCM) and Rubin’s causal model (RCM), both of which introduce perturbations to draw causal inference. Farooq et al., (2023) utilized SCM-based causal discovery approaches to unravel relationships between omics features and survival of breast cancer patients. However, to identify reliable causal relations for clinical applications, laboratory-based experiments, e.g. clustered regularly interspaced short palindromic repeats (CRISPR) techniques (Wang and Doudna,, 2023), are often necessary to verify the functional relevance of the identified omics features. Causal mediation analysis is an important tool, which considers the problem of decomposing the causal effect of treatment/exposure into direct and indirect effects (Lange and Hansen,, 2011; VanderWeele,, 2011). The direct effect corresponds to the effect of a treatment directly on the survival outcome, while an indirect effect corresponds to the effect of a treatment on the outcome that is due to its effect on an intermediate variable (e.g. gene expression) that also has a causal effect on the survival outcome. High-dimensional RCM-based mediation analysis has been used to investigate the indirect effect transmitted by omics features between an exposure and survival outcomes (Song et al.,, 2020, 2021). Targeted learning also fills a much needed gap between statistical modelling and causal inference (van der Laan and Rose,, 2011, 2018). Tuglus and van der Laan, (2011) used targeted maximum likelihood estimation to provide an interpretable causal measure of variable importance for the discovery of biomarkers and omics features. Another way to formalize personalized medicine is dynamic treatment regimes (Tsiatis et al.,, 2019; Deliu et al.,, 2023) that encompasses causal inference and takes into account for the variability in omics, environment and lifestyle factors for each individual to improve the treatment of a particular patient.

Data availability

Supplementary step-by-step R tutorial is available online at https://ocbe-uio.github.io/survomics/survomics.html. TCGA data is publicly available at https://portal.gdc.cancer.gov.

Acknowledgements

The authors thank Professor Ørnulf Borgan for his valuable suggestions and comments. This work was supported by grants from Helse Sør-Øst (grant 2020026 to TA), the Norwegian Cancer Society (216104 to TA), the Radium Hospital Foundation, the Academy of Finland (grants 326238, 340141, 344698, 345803 to TA), Cancer Society of Finland (to TA), the European Union’s Horizon 2020 research and innovation programme (grant ‘PANCAIM’ 101016851 to JZ and TA, grant ‘RESCUER’ 847912 to MZ), the Innovative Medicines Initiative 2 Joint Undertaking of the European Union’s Horizon 2020 research and innovation programme and EFPIA and JDRF INTERNATIONAL (grant ‘imSAVAR’ 853988 to MZ), ERA PerMed under the ERA-NET Cofund scheme of the European Union’s Horizon 2020 research and innovation framework programme (grant ERAPerMed2021_330_SYMMETRY to MZ).

Bibliography135

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aittokallio, (2009) Aittokallio, T. (2009). Dealing with missing values in large-scale studies: microarray data imputation and beyond. Briefings in Bioinformatics , 11(2):253–264.
2Akbani et al., (2014) Akbani, R., Ng, P. K. S., Werner, H. M. J., Shahmoradgoli, M., Zhang, F., Ju, Z., Liu, W., Yang, J.-Y., Yoshihara, K., Li, J., Ling, S., Seviour, E. G., Ram, P. T., Minna, J. D., Diao, L., Tong, P., Heymach, J. V., Hill, S. M., Dondelinger, F., Städler, N., Byers, L. A., Meric-Bernstam, F., Weinstein, J. N., Broom, B. M., Verhaak, R. G. W., Liang, H., Mukherjee, S., Lu, Y., and Mills, G. B. (2014). A pan-cancer proteomic perspective on the cancer genome atlas. N
3Andres et al., (2018) Andres, A., Montano-Loza, A., Greiner, R., Uhlich, M., Jin, P., Hoehn, B., Bigam, D., Shapiro, J. A. M., and Kneteman, N. M. (2018). A novel learning algorithm to predict individual survival after liver transplantation for primary sclerosing cholangitis. PLOS ONE , 13(3):e 0193523.
4Andrews et al., (2021) Andrews, T. S., Kiselev, V. Y., Mc Carthy, D., and Hemberg, M. (2021). Tutorial: guidelines for the computational analysis of single-cell rna sequencing data. Nature Protocols , 16(1):1–9.
5Antolini et al., (2005) Antolini, L., Boracchi, P., and Biganzoli, E. (2005). A time-dependent discrimination index for survival data. Statistics in Medicine , 24(24):3927–3944.
6Avati et al., (2020) Avati, A., Duan, T., Zhou, S., Jung, K., Shah, N. H., and Ng, A. Y. (2020). Countdown Regression: Sharp and Calibrated Survival Predictions. In Adams, R. P. and Gogate, V., editors, Proceedings of The 35th Uncertainty in Artificial Intelligence Conference , volume 115 of Proceedings of Machine Learning Research , pages 145–155. PMLR.
7Bair and Tibshirani, (2004) Bair, E. and Tibshirani, R. (2004). Semi-supervised methods to predict patient survival from gene expression data. PLOS Biology , 2(4):e 108.
8Barbieri and Berger, (2004) Barbieri, M. M. and Berger, J. O. (2004). Optimal predictive model selection. The Annals of Statistics , 32(3):870–897.