Replica analysis of overfitting in regression models for time-to-event   data

ACC Coolen; JE Barrett; P Paga; CJ Perez-Vicente

arXiv:1705.01730·stat.AP·September 13, 2017

Replica analysis of overfitting in regression models for time-to-event data

ACC Coolen, JE Barrett, P Paga, CJ Perez-Vicente

PDF

TL;DR

This paper develops a mathematical theory using the replica method to analyze and correct overfitting in regression models for time-to-event data, addressing a critical challenge in high-dimensional survival analysis.

Contribution

It introduces a novel application of the replica method to quantify and mitigate overfitting in survival regression models, including Cox's proportional hazards model.

Findings

01

The theory accurately predicts overfitting effects in Cox models.

02

Provides practical tools for correcting overfitting in survival analysis.

03

Enhances understanding of overfitting in high-dimensional clinical data.

Abstract

Overfitting, which happens when the number of parameters in a model is too large compared to the number of data points available for determining these parameters, is a serious and growing problem in survival analysis. While modern medicine presents us with data of unprecedented dimensionality, these data cannot yet be used effectively for clinical outcome prediction. Standard error measures in maximum likelihood regression, such as p-values and z-scores, are blind to overfitting, and even for Cox's proportional hazards model (the main tool of medical statisticians), one finds in literature only rules of thumb on the number of samples required to avoid overfitting. In this paper we present a mathematical theory of overfitting in regression models for time-to-event data, which aims to increase our quantitative understanding of the problem and provide practical tools with which to correct…

Equations281

\displaystyle\mbox{\boldmath$z$}_{i}\in{\rm I\!R}^{p}\!:

\displaystyle\mbox{\boldmath$z$}_{i}\in{\rm I\!R}^{p}\!:

t_{i} > 0 :

\displaystyle\mathscr{D}=\{(\mbox{\boldmath$z$}_{1},t_{1}),\ldots,(\mbox{\boldmath$z$}_{N},t_{N})\}

\displaystyle\mathscr{D}=\{(\mbox{\boldmath$z$}_{1},t_{1}),\ldots,(\mbox{\boldmath$z$}_{N},t_{N})\}

\displaystyle P(\mathscr{D}|\mbox{\boldmath$\theta$})=\prod_{i=1}^{N}P(t_{i}|\mbox{\boldmath$z$}_{i},\mbox{\boldmath$\theta$})

\displaystyle P(\mathscr{D}|\mbox{\boldmath$\theta$})=\prod_{i=1}^{N}P(t_{i}|\mbox{\boldmath$z$}_{i},\mbox{\boldmath$\theta$})

\displaystyle\hat{P}(t,\mbox{\boldmath$z$}|\mathscr{D})

\displaystyle\hat{P}(t,\mbox{\boldmath$z$}|\mathscr{D})

\displaystyle\frac{1}{N}\log P(\mathscr{D}|\mbox{\boldmath$\theta$})

\displaystyle\frac{1}{N}\log P(\mathscr{D}|\mbox{\boldmath$\theta$})

\displaystyle H(t|\mbox{\boldmath$z$},\mathscr{D})

\displaystyle H(t|\mbox{\boldmath$z$},\mathscr{D})

\displaystyle D(\hat{P}_{\mathscr{D}}||P_{\mbox{\boldmath$\theta$}})

\displaystyle\mbox{\boldmath$\theta$}_{\rm ML}

\displaystyle\mbox{\boldmath$\theta$}_{\rm ML}

\displaystyle\lim_{N\to\infty}\mbox{\boldmath$\theta$}_{\rm ML}

\displaystyle\lim_{N\to\infty}\mbox{\boldmath$\theta$}_{\rm ML}

\displaystyle E(\mbox{\boldmath$\theta$}^{\star}\!,\mathscr{D})

\displaystyle E(\mbox{\boldmath$\theta$}^{\star}\!,\mathscr{D})

\displaystyle E(\mbox{\boldmath$\theta$}^{\star}\!,\mathscr{D})>0:

\displaystyle E(\mbox{\boldmath$\theta$}^{\star}\!,\mathscr{D})>0:

\displaystyle E(\mbox{\boldmath$\theta$}^{\star}\!,\mathscr{D})=0:

\displaystyle E(\mbox{\boldmath$\theta$}^{\star}\!,\mathscr{D})<0:

\displaystyle E(\mbox{\boldmath$\theta$}^{\star})

\displaystyle E(\mbox{\boldmath$\theta$}^{\star})

\displaystyle\langle F(t_{1},\ldots,t_{N};\mbox{\boldmath$z$}_{1},\ldots,\mbox{\boldmath$z$}_{N})\rangle_{\mathscr{D}}

\displaystyle\langle F(t_{1},\ldots,t_{N};\mbox{\boldmath$z$}_{1},\ldots,\mbox{\boldmath$z$}_{N})\rangle_{\mathscr{D}}

\displaystyle E(\mbox{\boldmath$\theta$}^{\star})

\displaystyle E(\mbox{\boldmath$\theta$}^{\star})

\displaystyle E_{\gamma}(\mbox{\boldmath$\theta$}^{\star})

\displaystyle E_{\gamma}(\mbox{\boldmath$\theta$}^{\star})

\displaystyle E_{\gamma}(\mbox{\boldmath$\theta$}^{\star})

\displaystyle P(t|\mbox{\boldmath$z$},\mbox{\boldmath$\beta$},\lambda)

\displaystyle P(t|\mbox{\boldmath$z$},\mbox{\boldmath$\beta$},\lambda)

\displaystyle E_{\gamma}(\mbox{\boldmath$\beta$}^{\star}\!,\lambda^{\star})

\displaystyle E_{\gamma}(\mbox{\boldmath$\beta$}^{\star}\!,\lambda^{\star})

p (t ∣ ξ, λ)

p (t ∣ ξ, λ)

\displaystyle p(\mbox{\boldmath$y$}|\mbox{\boldmath$\beta$}^{0},\ldots,\mbox{\boldmath$\beta$}^{n})

\displaystyle E_{\gamma}(\mbox{\boldmath$\beta$}^{0}\!,\lambda_{0})

\displaystyle E_{\gamma}(\mbox{\boldmath$\beta$}^{0}\!,\lambda_{0})

\displaystyle p(\mbox{\boldmath$y$}|\mbox{\boldmath$\beta$}^{0},\ldots,\mbox{\boldmath$\beta$}^{n})

\displaystyle p(\mbox{\boldmath$y$}|\mbox{\boldmath$\beta$}^{0},\ldots,\mbox{\boldmath$\beta$}^{n})

\displaystyle C_{\alpha\rho}[\{\mbox{\boldmath$\beta$}\}]

\displaystyle C_{\alpha\rho}[\{\mbox{\boldmath$\beta$}\}]

1

1

\displaystyle E_{\gamma}(\mbox{\boldmath$\beta$}^{0}\!,\lambda_{0})

\displaystyle E_{\gamma}(\mbox{\boldmath$\beta$}^{0}\!,\lambda_{0})

E_{γ} (P, λ_{0})

E_{γ} (P, λ_{0})

N \to \infty lim E_{γ} (P, λ_{0})

N \to \infty lim E_{γ} (P, λ_{0})

Ψ [\dots]

Ψ [\dots]

α, ρ = 1 \dots n :

α, ρ = 1 \dots n :

Ψ [\dots]

Ψ [\dots]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Replica analysis of overfitting in regression models for time-to-event data

ACC Coolen*†‡, JE Barrett§, P Paga†‡*, CJ Perez-Vicente¶

${\dagger}$ Department of Mathematics, King’s College London,

The Strand, London WC2R 2LS, UK

${\ddagger}$ Institute for Mathematical and Molecular Biomedicine, King’s College London,

Hodgkin Building, Guy’s Campus, London SE1 1UL, UK

$\S$ Department of Primary Care and Public Health Sciences, King’s College

London, Addison House, Guy’s Campus, London SE1 1UL, UK

$\P$ Departament de Física Fonamental, Universitat de Barcelona,

08028 Barcelona, Spain

[[email protected], [email protected],

[email protected], [email protected]](mailto:[email protected],%[email protected],)

Abstract

Overfitting, which happens when the number of parameters in a model is too large compared to the number of data points available for determining these parameters, is a serious and growing problem in survival analysis. While modern medicine presents us with data of unprecedented dimensionality, these data cannot yet be used effectively for clinical outcome prediction. Standard error measures in maximum likelihood regression, such as p-values and z-scores, are blind to overfitting, and even for Cox’s proportional hazards model (the main tool of medical statisticians), one finds in literature only rules of thumb on the number of samples required to avoid overfitting. In this paper we present a mathematical theory of overfitting in regression models for time-to-event data, which aims to increase our quantitative understanding of the problem and provide practical tools with which to correct regression outcomes for the impact of overfitting. It is based on the replica method, a statistical mechanical technique for the analysis of heterogeneous many-variable systems that has been used successfully for several decades in physics, biology, and computer science, but not yet in medical statistics. We develop the theory initially for arbitrary regression models for time-to-event data, and verify its predictions in detail for the popular Cox model.

pacs:

05.70.Fh, 02.50.-r

1 Introduction

In the simplest possible scenario, survival analysis is concerned with data of the following form. We consider a cohort of $N$ individuals, each of whom are at risk of a specified irreversible event, such as the onset of a given disease or death. For each individual $i$ in this cohort we are given $p$ specific measurements $\mbox{\boldmath$ z $}_{i}=(z_{i1},\ldots,z_{ip})$ (the covariates) which were taken at a baseline time $t=0$ , as well as the time $t_{i}>0$ at which for individual $i$ we either observed the irreversible event, or we ceased our observation without having observed the event yet (the latter case is called ‘censoring’). More complex scenarios could involve e.g. having multiple distinct risk types, such as distinct causes of death, or interval censoring, where rather than $t_{i}$ itself, one is given an interval that contains $t_{i}$ . The theory developed in this paper can be generalised without serious difficulty to include such extensions, but in the interest of transparency we will focus for now strictly on the simplest case.

[TABLE]

$\bullet$ x $\vdots$$i\!=\!1$$i\!=\!2$$\vdots$$\bullet$ x $\bullet$ x $\bullet$ x $i\!=\!N$$i\!=\!N\!-\!1$$t\!=\!0$

The aim of survival analysis is regression, i.e. to use our data for detecting and quantifying probabilistic patterns (if any) that relate an individual’s failure time $t$ to their covariates $z$ . Such patterns may allow us to predict individual patients’ clinical outcomes, distinguish between high-risk and low-risk patients, reveal general disease mechanisms, or design new data-driven therapeutic interventions (by changing the values of modifiable covariates). For general reviews of the considerable survival analysis literature we refer to textbooks such as [1, 2, 3, 4]111Non-medical applications of survival analysis include e.g. the study of the time to component failure in manufacturing, or of the duration of unemployment in economics.. Being able to use the extracted patterns to predict clinical outcomes for unseen patients is the only reliable test of whether our regression results represent true knowledge. Accurate prediction requires that we use as much of the available covariate information as possible, so our focus must be on multivariate regression methods.

Most multivariate survival analysis methods are based on postulating a suitable and plausible parametrisation of the covariate-conditioned event time distribution, whose parameters are estimated from the data via either the maximum likelihood protocol (ML), or (following Bayesian reasoning) via maximum a posteriori probability (MAP). The most popular parametrisation is undoubtedly the proportional hazards model of Cox [5], which uses ML inference, and assumes the event time distribution to be of the so-called proportional hazards form $p(t|\mbox{\boldmath$ z $})=-\frac{\rmd}{\rmd t}\exp[-\exp(\mbox{\boldmath$ \beta $}\!\cdot\!\mbox{\boldmath$ z $})\Lambda(t)]$ . MAP versions of [5] are the so-called ‘penalised Cox’ or ‘ridge’ regression models (with Gaussian parameter priors), see e.g. [6, 7]. More complex parametrisation proposals, such as frailty or random effects models [8, 9, 10, 11] or latent class models [12], still tend to have proportional hazards type formulae as their building blocks. In all such models the number of parameters is always larger than or equal to the number $p$ of covariates. Hence, to avoid overfitting they can be used safely only when $N\gg p$ . This limitation was harmless in the 1970s and 1980s, when many of the currently used models were devised, and where one would typically have datasets with $p\sim 10^{2}$ at most. For the data of post-genome medicine, however, where we regularly have $p\sim 10^{4-6}$ , it poses a serious problem which has for instance prevented us from using genomic covariates in rigorous multivariate regression protocols, forcing us instead to work with ‘gene signatures’.

Overfitting in survival analysis models [14, 15] can be visualized effectively by combining regression with cross-validation. For the Cox model, for instance, one can use the inferred association parameters $\beta$ of [5] in combination with Breslow’s [16] estimator for the base hazard rate (which is the canonical estimator for [5]), to predict whether an event will have happened by a given cutoff time, and compare the fraction of correct predictions in the training set (the data used for regression) to those in a validation set (the unseen data). When drawn as functions of the number of covariates used, the resulting curves typically exhibit the standard fingerprints of overfitting [17, 18]; see Figure 1. Simulations with synthetic data [19] showed that the optimal number of covariates in Cox regression (see arrows in Figure 1) tends to be roughly proportional to the number of samples $N$ . Given this observed phenomenology, it seems vital before doing multivariate regression to have a tool for estimating the minimum number of samples or events needed to avoid the overfitting regime. To our knowledge, there is no theory in the literature yet for predicting this number, not even for the Cox model [5]. One finds only rules of thumb – e.g. the number of failure events must exceed 10 times the number of independent covariates – and empirical bootstrapping protocols, often based on relatively small scale simulation data [19, 20, 21]. This situation is not satisfactory.

To increase our intuition for the problem, we first explore via simple simulation studies the relation between inferred and true parameters in Cox’s model [5]. The parameters of [5] are the vector $\mbox{\boldmath$ \beta $}=(\beta_{1},\ldots,\beta_{p})$ of regression coefficients (where $p$ is the number of covariates), and the base hazard rate $\lambda(t)=\rmd\Lambda(t)/\rmd t$ . We generated association parameters and covariates randomly from zero-average Gaussian distributions, and corresponding synthetic survival data using Cox’s model without censoring (so all $N$ samples correspond to failure events), for different base hazard rates. To understand the nature of the overfitting-induced regression errors we plotted the $p$ pairs $(\beta_{\mu},\hat{\beta}_{\mu})$ as points in the plane, where $\beta_{\mu}$ and $\hat{\beta}_{\mu}$ are the true and inferred association parameters of covariate $\mu$ , respectively, calculated via the recipes of [5]. This resulted in scatterplots as shown in Figure 2. Simulations were done for different values of the ratio $p/N$ , with multiple independent runs such that the number of points in each panel is identical. The true association parameters were drawn independently from a zero-average Gaussian distribution with $\langle\beta_{\mu}^{2}\rangle=0.25$ for all $\mu$ . Perfect regression would imply finding all points to lie on the diagonal. Rather than a widening of the variance (as with finite sample size regression errors) overfitting-induced errors are somewhat surprisingly seen to manifest themselves mainly as a reproducible tilt of the data cloud, which increases with $p/N$ , and implies a consistent over-estimation of associations: both positive and negative $\beta_{\mu}$ will always be reported as more extreme than their true values. These observed errors in association parameters appear to be independent of the form of the true base hazard rate. Similarly, we show in Figure 3 the inferred integrated base hazard rates $\hat{\Lambda}(t)$ versus time (solid lines), together with the true values (dashed), which again shows consistent and reproducible overfitting errors. A quantitative theory of overfitting that can predict both the observed tilt and width of the data clouds of Figure 2 and the deformed inferred hazard rates of Figure 3 would enable us to correct the inferred parameters of the Cox model for overfitting, and thereby enable reliable regression up to hitherto forbidden ratios of $p/N$ .

There are mathematical obstacles to the development of a theory of overfitting in survival analysis, which probably explain why it has so far remained an open problem. First, unlike discriminant analysis, it is not immediately clear which error measure to study when outcomes to be predicted are event times. Second, in most survival analysis models (including Cox regression) the estimated parameters are to be solved from coupled transcendental equations, and cannot therefore be written in explicit form. Third, in the overfitting regime one will by definition find even for large $N$ that the inferred parameters depend on the realisation of the data set, while at the more macroscopic level of prediction accuracy there is no such dependence. It is thus not a priori clear which quantities to focus on in analytical studies of the regression process, and at which stage in the calculation (if any) averages over possible realisations of the data set may be performed safely.

Our present approach to the problem consists of distinct stages, each removing a specific obstacle, and this is reflected in the structure of our paper. We adapt to time-to-event regression the strategy proposed and executed several decades ago for binary classifiers in the groundbreaking paper by Gardner [22]. We first translate the problem of modelling overfitting into the calculation of a specific information-theoretic generating function, from which we can extract the information we need. Next we use Laplace’s argument to eliminate the maximisation over model parameters that comes with all ML methods, which is equivalent to writing the ground state energy of a statistical mechanical system as the zero temperature limit of the free energy. The third stage is devoted to making the resulting calculation of the generating function feasible, using the so-called replica method. This method has an impressive track record of several decades in the analysis of complex heterogeneous many-variable systems in physics [23, 24, 25, 26, 27], computer science [22, 28], biology [29, 30, 31], and economics [32, 33], and enables us to carry out analytically the average of the generating function over all possible realisations of the data set. Finally we exploit steepest descent integration for $N\to\infty$ , leading to the identification of the ‘natural’ macroscopic order parameters of the problem, for which we derive closed equations within the replica symmetric (RS) ansatz. Some technical arguments are placed in appendices, to improve the flow of the paper. We develop our methods initially for generic time-to-event regression models, and then specialise to the Cox model. The final RS equations obtained for the Cox model involve a small number of scalar order parameters, from which we can compute the link between true and inferred regression parameters, and the inferred base hazard rate. The functional saddle point equation for the base hazard rate is rather nontrivial; while we can calculate the asymptotic form of its solution analytically, we limit ourselves mostly to a variational approximation, which already turns out to be quite accurate. We close with a discussion of our results, their implications and applications, and avenues for future work.

2 Overfitting in Maximum Likelihood models for survival analysis

2.1 Definitions

We assume we have simple time-to-event data $\mathscr{D}$ of the standard type, consisting of $N$ independently drawn samples $i=1\ldots N$ , with just one active risk and no censoring. Each sample consists of a covariate vector $\mbox{\boldmath$ z $}_{i}\in{\rm I\!R}^{p}$ , drawn independently from a distribution $P(\mbox{\boldmath$ z $})$ , and an associated time to event $t_{i}\in[0,\infty)$ , drawn from $P(t|\mbox{\boldmath$ z $},\mbox{\boldmath$ \theta $}^{\star})$ :

[TABLE]

Here $P(t|\mbox{\boldmath$ z $},\mbox{\boldmath$ \theta $}^{\star})$ describes a parametrised time-generating model, with $q$ unknown real-valued parameters collected in a vector $\mbox{\boldmath$ \theta $}^{\star}\in{\rm I\!R}^{q}$ that we seek to estimate from the data $\mathscr{D}$ . We are not interested in estimating $P(\mbox{\boldmath$ z $})$ , so we take the covariate vectors $\{\mbox{\boldmath$ z $}_{1},\ldots,\mbox{\boldmath$ z $}_{N}\}$ as given. The data probability for each parameter choice $\theta$ is

[TABLE]

We next define the empirical distribution of covariates and event times, given the observed data:

[TABLE]

This allows us to write

[TABLE]

with the conditional differential Shannon entropy of the event time distribution, and the Kullback-Leibler distance [34] between the empirical distribution $\hat{P}(t|\mbox{\boldmath$ z $},\mathscr{D})$ and the parametrised form $P(t|\mbox{\boldmath$ z $},\mbox{\boldmath$ \theta $})$ :

[TABLE]

The parameters $\theta$ estimated via the ML recipe are those that maximise $P(\mathscr{D}|\mbox{\boldmath$ \theta $})$ . According to (4) they minimise the Kullback-Leibler distance $D(\hat{P}_{\mathscr{D}}||P_{\mbox{\boldmath$ \theta $}})$ between the empirical covariate-conditioned event time distribution and the parametrised event time distribution with parameter values $\theta$ :

[TABLE]

If $N\to\infty$ for fixed $p$ and $q$ , the law of large numbers guarantees that $\lim_{N\to\infty}\hat{P}(t|\mbox{\boldmath$ z $},\mathscr{D})=P(t|\mbox{\boldmath$ z $},\mbox{\boldmath$ \theta $}^{\star})$ (in a distributional sense), and hence ML regression will indeed estimate the parameters $\theta$ asymptotically correctly, provided the chosen paramerisation is unambiguous:

[TABLE]

In this paper, however, we focus on the regime of large datasets with high-dimensional covariate and parameter vectors where overfitting occurs, namely $p,q={\mathcal{O}}(N)$ and $N\to\infty$ . Here $\hat{P}(t|\mbox{\boldmath$ z $},\mathscr{D})$ no longer converges to $P(t|\mbox{\boldmath$ z $},\mbox{\boldmath$ \theta $}^{\star})$ for $N\to\infty$ in any mathematical sense, the identity (8) is therefore violated, and minimising $D(\hat{P}_{\mathscr{D}}||P_{\mbox{\boldmath$ \theta $}})$ as per the ML prescription is no longer appropriate. This is the information-theoretic description of the overfitting phenomenon in survival analysis.

2.2 An information-theoretic measure of under- and overfitting

Maximum likelihood regression algorithms report those parameters $\theta$ for which $P(t,\mbox{\boldmath$ z $}|\mbox{\boldmath$ \theta $})$ is as similar as possible to the empirical distribution $\hat{P}(t|\mbox{\boldmath$ z $},\mathscr{D})$ , as opposed to the true distribution $P(t|\mbox{\boldmath$ z $},\mbox{\boldmath$ \theta $}^{\star})$ from which the data $\mathscr{D}$ were generated. The optimal outcome of regression is for the inferred parameters to be identical to the true ones, i.e. to find ${\rm argmin}_{\mbox{\boldmath$ \theta $}}~{}D(\hat{P}_{\mathscr{D}}||P_{\mbox{\boldmath$ \theta $}})=\mbox{\boldmath$ \theta $}^{\star}$ . We therefore define

[TABLE]

This allows us to interpret the value of $E(\mbox{\boldmath$ \theta $}^{\star}\!,\mathscr{D})$ as a measure of ML regression performance:

[TABLE]

Optimal regression algorithms would reduce $D(\hat{P}_{\mathscr{D}}||P_{\mbox{\boldmath$ \theta $}})$ until $D(\hat{P}_{\mathscr{D}}||P_{\mbox{\boldmath$ \theta $}})=D(\hat{P}_{\mathscr{D}}||P_{\mbox{\boldmath$ \theta $}^{\star}})$ and then stop. Maximum likelihood regression will not do this; if it can reduce the Kullback-Leibler distance further it will do so, and thereby cause overfitting. For $N\to\infty$ we expect $E(\mbox{\boldmath$ \theta $}^{\star}\!,\mathscr{D})$ to depend on the data $\mathscr{D}$ only via $P(\mbox{\boldmath$ z $})$ and $\mbox{\boldmath$ \theta $}^{\star}$ , this is the fundamental assumption behind any regression. It allows us to focus on the average of $E(\mbox{\boldmath$ \theta $}^{\star}\!,\mathscr{D})$ over all realisations of the data, given $P(\mbox{\boldmath$ z $})$ and $\mbox{\boldmath$ \theta $}^{\star}$ :

[TABLE]

in which

[TABLE]

Evaluating $E(\mbox{\boldmath$ \theta $}^{\star})$ analytically for $N\to\infty$ is the focus of this paper. Clearly, if the relevant minimum over $\theta$ corresponds to the true value $\mbox{\boldmath$ \theta $}^{\star}$ for all $\mathscr{D}$ , then $E(\mbox{\boldmath$ \theta $}^{\star})=0$ .

2.3 Analytical evaluation of the average over data sets

Working out (13) analytically for large $N$ requires first that we deal with the minimisation over $\theta$ . This can be done by converting the problem into the calculation of the ground state energy for a statistical mechanical system with degrees of freedom $\mbox{\boldmath$ \theta $}\in{\rm I\!R}^{q}$ and Hamiltonian222The rescaling with $N$ of the Hamiltonian is done in anticipation of subsequent limits. $H(\mbox{\boldmath$ \theta $})=NE(\mbox{\boldmath$ \theta $})$ :

[TABLE]

For finite $\gamma$ , the quantity $E_{\gamma}(\mbox{\boldmath$ \theta $}^{\star})$ can be interpreted as the average result of a stochastic minimisation, based on carrying out gradient descent on the function $-\log P(\mathscr{D}|\mbox{\boldmath$ \theta $})$ , supplemented by a Gaussian white noise with variance proportional to $\gamma^{-1}$ .

The remaining obstacle is the logarithm in (16), which prevents the average over all data sets $\mathscr{D}$ from factorising over the samples. This we handle using the so-called replica method, which is based on the identity $\langle\log Z\rangle=\lim_{n\to 0}n^{-1}\log\langle Z^{n}\rangle$ , and to our knowledge has not yet been applied in survival analysis. In the replica method the average $\langle Z^{n}\rangle$ is carried out for integer $n$ , and the limit $n\to 0$ is taken at the end of the calculation via analytical continuation. Application to (16) leads us after some simple manipulations to a new expression in which the average over data sets does factorise over samples:

[TABLE]

The average over data sets has now been done, and we are left with a completely general explicit expression for $E(\mbox{\boldmath$ \theta $}^{\star})$ in terms of the covariate statistics $P(\mbox{\boldmath$ z $})$ and the assumed parametrised data generating model $P(t|\mbox{\boldmath$ z $},\mbox{\boldmath$ \theta $})$ . We will now work out and study this expression for Cox’s proportional hazards model [5] with statistically independent zero-average Gaussian covariates.

2.4 Application to Cox regression

In Cox’s method [5] the model parameters are a base hazard rate $\lambda(t)\geq 0$ (with $t\geq 0$ ) and a vector $\mbox{\boldmath$ \beta $}\in{\rm I\!R}^{p}$ of regression coefficients. The assumed event time statistics are then of the following form:

[TABLE]

The factors $\sqrt{p}$ only induce an irrelevant scaling factor that will make it easier to take the limit $p\to\infty$ . In fact, for large $p$ it is inevitable that the typical association parameter in the Cox model will scale as ${\mathcal{O}}(p^{-\frac{1}{2}})$ , since otherwise one would not find finite nonzero event times.

For simplicity we assume that the covariates are distributed according to $P(\mbox{\boldmath$ z $})=(2\pi)^{-p/2}\exp(-\frac{1}{2}\mbox{\boldmath$ z $}^{2})$ . This restriction of our analysis to uncorrelated covariates is no limitation, since for the Cox model one can always obtain, via a simple mapping, the regression results for data with correlated covariates from those obtained for uncorrelated covariates. This is demonstrated in A.

For the Cox model our general result (17) takes the following form, involving ordinary integration over $n$ -fold replicated vectors $\mbox{\boldmath$ \beta $}^{\alpha}$ and functional integration over $n$ -fold replicated base hazard rates $\lambda^{\alpha}$ :

[TABLE]

To enable efficient further analysis we define the short-hands

[TABLE]

and the $n\!+\!1$ -dimensional vector $\mbox{\boldmath$ y $}=(y_{0},\ldots,y_{p})$ . In addition we rename $(\mbox{\boldmath$ \beta $}^{\star},\lambda^{\star})=(\mbox{\boldmath$ \beta $}^{0},\lambda^{0})$ , so that

[TABLE]

All $\{y_{\alpha}\}$ are linear combinations of Gaussian random variables, so also $p(\mbox{\boldmath$ y $}|\mbox{\boldmath$ \beta $}^{0},\ldots,\mbox{\boldmath$ \beta $}^{n})$ will be Gaussian (even for most non-Gaussian covariates this would still hold for large $p$ due to the central limit theorem), giving

[TABLE]

in which the entries of the $(n\!+\!1)\times(n\!+\!1)$ covariance matrix $\mbox{\boldmath$ C $}[\{\mbox{\boldmath$ \beta $}\}]$ are

[TABLE]

We introduce integrals over $\delta$ -distributions to transport variables to more convenient places, by substituting for each pair $(\alpha,\rho)$ :

[TABLE]

We then obtain, after some simple manipulations,

[TABLE]

For finite $N$ , expressions such as (26) are of course not easy to use, but as with all statistical theories we will be able to progress upon assuming $N$ to be large333Note that the standard use of Cox regression away from the overfitting regime, including its formulae for confidence intervals and for p-values (which require Gaussian approximations that build on large $N$ expansions around the most probable parameter values, and assume that uncertainty in base hazard rates can be neglected), is similarly valid only when $N$ is sufficiently large.. We therefore focus on the asymptotic behaviour of (26) for $N\to\infty$ , but with a fixed ratio $p/N$ , and will confirm a posteriori the extent to which the resulting theory describes what is observed for large but finite sample sizes.

3 Asymptotic analysis of overfitting in the Cox model

3.1 Conversion to a saddle-point problem

Following extensive experience with the replica method in other disciplines, with similar definitions, we assume that the two limits $N\to\infty$ and $n\to 0$ commute. The invariance of the right-hand side of (26) under all permutations of the sample indices $i\in\{1,\ldots,N\}$ implies that $E(\mbox{\boldmath$ \beta $}^{0},\lambda_{0})$ can depend on the true association parameters $\mbox{\boldmath$ \beta $}^{0}$ only via the distribution $P(\beta_{0})=p^{-1}\sum_{\mu=1}^{p}\delta[\beta_{0}-\beta_{\mu}^{0}]$ . With a modest amount of foresight we define $S^{2}=p^{-1}\sum_{\mu=1}^{p}(\beta_{\mu}^{0})^{2}$ , and obtain

[TABLE]

Writing the ratio of covariates over samples as $p/N=\zeta$ , to be kept fixed in the limit $N\to\infty$ , we may take the limit $N\to\infty$ and obtain an integral that can be evaluated using steepest descent:

[TABLE]

in which the function to be extremized is

[TABLE]

Differentiation with respect to $\hat{C}_{00}$ immediately gives $C_{00}=S^{2}$ . Moreover, for various integrals to be well-defined, the relevant saddle-point must (after contour deformation in the complex plane) be of a form where

[TABLE]

with $D_{\alpha\rho},d_{\rho}\in{\rm I\!R}$ , and where the $n\times n$ matrix $\mbox{\boldmath$ D $}=\{D_{\alpha\rho}\}$ is positive definite. Thus at the relevant saddle-point we will have

[TABLE]

Variation with respect to the $n$ components $\{d_{\alpha}\}$ gives $d_{\alpha}=-S^{-2}\sum_{\rho}D_{\alpha\rho}C_{0\rho}$ , so

[TABLE]

This intermediate result confirms that $\lim_{N\to\infty}E_{\gamma}(P,\lambda_{0})$ indeed depends on the distribution $P(\beta_{0})$ only via $S^{2}=\int\!\rmd\beta_{0}~{}P(\beta_{0})\beta_{0}^{2}$ , hence we may henceforth write the former quantity as $E_{\gamma}(S,\lambda_{0})$ . Variation with respect to $D$ finally gives $(\mbox{\boldmath$ D $}^{-1})_{\alpha\rho}=C_{\alpha\rho}\!-\!C_{0\alpha}C_{0\rho}/S^{2}$ . Hence we arrive at the following expression, in which the short-hand $\mbox{\boldmath$ C $}^{\prime}$ denotes the $n\times n$ matrix with entries $C^{\prime}_{\alpha\rho}=C_{\alpha\rho}\!-\!C_{0\alpha}C_{0\rho}/S^{2}$ (for $\alpha,\rho=1\ldots n$ ):

[TABLE]

The extremisation over $C$ is to be done subject to $C_{00}=S^{2}$ , and we have removed from $\Psi[\ldots]$ those terms that will vanish after taking $n\to 0$ and differentiating with respect to $\gamma$ .

3.2 Replica symmetric extrema

The replica symmetry ansatz (RS) can be translated into the statement that the solution space of the regression algorithm is ergodic [25, 28, 18], i.e. the typical set of equivalent minima in regression parameter space is connected. Replica symmetric saddle-points of (3.1) are of the following form:

[TABLE]

In B we derive the equations corresponding to the RS ansatz for the stochastic generalization of the Cox model. With the short-hand ${\rm D}y=(2\pi)^{-1/2}\rme^{-\frac{1}{2}y^{2}}\rmd y$ , and upon removing terms that vanish upon differentiation by $\gamma$ , we can summarise these equations in the limit of large data sets, by the following compact expression:

[TABLE]

in which the order parameters $\{u,v,w;\lambda\}$ , which are related to the RS order parameters $\{C,c_{0},c\}$ via

[TABLE]

are to be evaluated at the saddle point of

[TABLE]

3.3 Physical interpretation of order parameters

The physical meaning of the order parameters in the replica symmetric matrix $C$ is found in the usual manner for replica calculations [25], by direct application of our manipulations to the calculation of observables. We will write averages over the stochastic maximization of the data log-likelihood at finite $\gamma$ , for a fixed training set $\mathscr{D}$ , as $\langle\ldots\rangle$ , and averages over all data sets (as before) as $\langle\ldots\rangle_{\mathscr{D}}$ . Since the relevant quantities in the theory are found asymptotically to depend on the true association vector $\mbox{\boldmath$ \beta $}^{\star}$ only via $S^{2}=p^{-1}\sum_{\mu=1}^{p}(\beta_{\mu}^{\star})^{2}$ , there is no need for explicit averages over $\mbox{\boldmath$ \beta $}^{\star}$ . This results upon application to the Cox model in the following identifications, in the limit $n\to 0$ :

[TABLE]

In terms of the transformed order parameters $(u,v,w)$ this becomes

[TABLE]

Here $\beta$ is the outcome of maximum likelihood regression for data set $\mathscr{D}$ generated with true association parameters $\mbox{\boldmath$ \beta $}^{\star}$ . Fully random parameter guessing would give $c_{0}=c=0$ and $C>0$ . Perfect regression would imply $\mbox{\boldmath$ \beta $}=\mbox{\boldmath$ \beta $}^{\star}$ for all $\mathscr{D}$ and all $\mbox{\boldmath$ \beta $}^{\star}$ , and hence correspond to $c_{0}=c=C=S^{2}$ , giving $u=v=0$ and $w=S$ . It is reassuring to observe that for $\zeta=0$ , expression (3.2) indeed reproduces $E_{\gamma}(S,\lambda_{0})=0$ if in the right-hand side we substitute the values $u=v=0$ and $w=S$ .

From (40) follow useful inequalities that must hold at the relevant saddle-point in the limit $n\to 0$ , which are consistent with our claim that $u,v,w\geq 0$ :

[TABLE]

The first four inequalities are easy to derive. The fifth follows from:

[TABLE]

If, as suggested by the $\gamma\to\infty$ simulation results shown in Section 1, $\langle\mbox{\boldmath$ \beta $}\rangle\approx\kappa\mbox{\boldmath$ \beta $}^{\star}+\mbox{\boldmath$ \xi $}$ for some $\kappa>0$ , with a zero-average random vector $\xi$ that reflects data set variability, such that $\langle\mbox{\boldmath$ \xi $}\rangle_{\mathscr{D}}=\mbox{\boldmath$ 0 $}$ and with amplitude $\lim_{p\to\infty}p^{-1}\sum_{\mu=1}^{p}\langle\xi_{\mu}^{2}\rangle_{\mathscr{D}}=\sigma^{2}$ , then we would find the RS saddle point obeying $c_{0}=\kappa S^{2}$ and $c=\kappa^{2}S^{2}+\sigma^{2}$ . Hence we would find $v=\sigma$ and $\kappa=w/S$ , and we would expect $\lim_{\gamma\to\infty}u=0$ for $\zeta<1$ . Note that the above relations are true given our definition of the event time distribution as $P(t|\mbox{\boldmath$ z $},\mbox{\boldmath$ \beta $},\lambda)=-\frac{\rmd}{\rmd t}\exp[-\exp(\mbox{\boldmath$ \beta $}\cdot\mbox{\boldmath$ z $}/\sqrt{p})\Lambda(t)]$ . If we were to define this distribution instead without the rescaling factor $\sqrt{p}$ as $P(t|\mbox{\boldmath$ z $},\mbox{\boldmath$ \beta $},\lambda)=-\frac{\rmd}{\rmd t}\exp[-\exp(\mbox{\boldmath$ \beta $}\cdot\mbox{\boldmath$ z $})\Lambda(t)]$ (which is the convention of [5]), then the connection between regression of the form $\langle\mbox{\boldmath$ \beta $}\rangle\approx\kappa\mbox{\boldmath$ \beta $}^{\star}+\mbox{\boldmath$ \xi $}$ and our order parameters would be:

[TABLE]

We conclude that from our RS equations we can extract the dependence on the covariates/samples ratio $\zeta=p/N$ of the two main quantitative characteristics of the data clouds in Figure 2: their angle $\kappa$ and their width $\sigma$ .

Finally, let us turn to the interpretation of equation (3.2). We observe that this equation can be written as

[TABLE]

If we compare expression (47) with the definition of $E_{\gamma}(S,\lambda_{0})$ , which for the Cox model is

[TABLE]

we can infer that

[TABLE]

As a consistency test one can confirm that, as an alternative to retracing the replica derivation, the expressions (40) can also be derived explicitly from (48,50).

3.4 Derivation of RS saddle point equations

The equations from which to solve the replica symmetric order parameters $(u,v,w,\lambda)$ are obtained by extremization of (3.2). Using $\partial\log p(t|\xi)/\partial\xi=1-\rme^{\xi}\Lambda(t)$ , the three scalar equations are found to be

[TABLE]

Upon integrating by parts over $y$ , we can also write equation (51) as

[TABLE]

To work out the functional order parameter equation $\delta\Psi_{\rm RS}(u,v,w;\lambda)/\delta\lambda(s)=0$ we use $\delta\log p(t|\xi)/\delta\lambda(s)=\delta(t\!-\!s)/\lambda(s)-\rme^{\xi}\theta(t\!-\!s)$ , and the abbreviation $p(t)=\int\!{\rm D}y_{0}~{}p(t|Sy_{0},\lambda_{0})$ . This gives

[TABLE]

This latter equation can also be written in terms of the distribution (48), giving a form that reduces to Breslow’s [16] estimator when we subsequently use the interpretation identity (50):

[TABLE]

The remaining integrations over $y$ in our equations are for finite $\gamma$ quite nontrivial. They can be expressed in terms of the Laplace transform of the lognormal distribution [36], or mapped onto the core integral in the Random Energy Model [37], both of which could in the past be evaluated analytically only in specific parameter limits.

4 Analysis of the RS equations for the Cox model

4.1 RS equations in the limit $\gamma\to\infty$

The original Cox model [5] corresponds to the limit $\gamma\to\infty$ of our equations. It turns out that the correct scaling with $\gamma$ of $u$ for $\gamma\to\infty$ is $u=\tilde{u}/\sqrt{\gamma}$ ; this is suggested by equation (3.4) and confirms our expectation that follows from the physical meaning of $u$ . Upon substituting $u=\tilde{u}/\sqrt{\gamma}$ as an ansatz into our equations, assuming the other order parameters to have finite $\gamma\to\infty$ limits, allows us to simplify the trio (LABEL:eq:spe_v,LABEL:eq:spe_w,3.4) and the functional equation (55) to

[TABLE]

The remaining complexities of the limit are concentrated in

[TABLE]

with

[TABLE]

After differentiation and rewriting the resulting equation, we find that $\varphi(\eta,t)$ can be written in explicit form in terms of the Lambert W-function [35] as:

[TABLE]

Hence

[TABLE]

Using the identity $\rme^{-W(z)}=W(z)/z$ , which follows directly from the definition of the Lambert $W$ -function, we can simplify the above result to

[TABLE]

Substitution into our $\gamma\to\infty$ order parameter equations finally gives:

[TABLE]

We observe that the choice $v=0$ always solves (67), but that for $\zeta>0$ it is ruled out by (66). Upon doing integration by parts over $z$ , using $\rmd W(z)/\rmd z=W(z)/z[1+W(z)]$ and dismissing the solution $v=0$ , we can simplify equation (67) further to

[TABLE]

To compute the corresponding value of the overfitting measure $E(S,\lambda_{0})=\lim_{\gamma\to\infty}E_{\gamma}(S,\lambda_{0})$ , we substitute $u=\tilde{u}/\sqrt{\gamma}$ into (3.2) and take the limit $\gamma\to\infty$ . This gives, using the short-hands (63) and $p(t)=\int\!{\rm D}y_{0}~{}p(t|Sy_{0},\lambda_{0})$ and the identity $\exp[-W(z)]=W(z)/z$ :

[TABLE]

The second integral can be worked out explicitly:

[TABLE]

Therefore

[TABLE]

In C we study the behaviour of the above equations in the two limits $\zeta\to 0$ and $\zeta\to 1$ . For $\zeta\to 0$ we recover the correct solution corresponding to perfect (overfitting-free) regression, as required. For $\zeta\to 1$ we find a phase transition, characterised by divergence of the order parameters $\{\tilde{u},v,w\}$ .

4.2 Numerical and asymptotic solution of RS equations

Solving the coupled order parameter equations (66,68,69,70) analytically seems for now too ambitious; solving them numerically is nontrivial, and requires some preparation. To cast the equation for $w$ into a form similar to the others, we need to do partial integration over $y_{0}$ :

[TABLE]

We also rewrite the functional equation in a form that involves $\Lambda(t)$ only:

[TABLE]

Numerical integration over $t>0$ can be transformed into integration over the survival function $s(t,y_{0})=\exp[-\rme^{Sy_{0}}\Lambda_{0}(t)]\in[0,1]$ , using $p(t|Sy_{0},\lambda_{0})\rmd t=-\rmd s$ and $t(s,y_{0})=\Lambda_{0}^{\rm inv}(\rme^{-Sy_{0}}\log(1/s))$ . We also define the short-hand $L(t)=\tilde{u}^{2}\rme^{\tilde{u}^{2}}\Lambda(t)$ . These definitions transform our RS equations to:

[TABLE]

We next study the functional equation (79) in more detail. We first rewrite it by differentiation with respect to time, and some simple rearrangements, into the more suitable form

[TABLE]

or, upon further differentiation:

[TABLE]

Using $\int_{0}^{1}\!\rmd s~{}\delta[t(s,y_{0})\!-\!t]=p(t|Sy_{0})$ , and upon multiplying both sides by $\frac{\rmd}{\rmd t}L(t)/p(t)$ , this becomes

[TABLE]

We write $L(t)$ in the form $L(t)=\Phi(\Lambda_{0}(t))$ , which is always possible since both $L(t)$ and $\Lambda_{0}(t)$ are monotonic functions of time, and we write $p(t)=\lambda_{0}(t)g(\Lambda_{0}(t))$ with

[TABLE]

Substitution of these conventions, and working out the various time derivatives, then leads to the following equation from which to solve $\Phi(x)$ :

[TABLE]

We now proceed to calculate the solution $\Phi(x)$ of the above equation, which gives us the form of the inferred integrated base hazard rates $\Lambda(t)$ as shown in Figure 3, for large times, i.e. in the regime where $x\to\infty$ and $\Phi(x)\to\infty$ . Here we can use use the asymptotic form of the Lambert $W$ -function [35]: $W(z)=\log z-\log(\log z)+{\mathcal{O}}(\log(\log z)/\log z)$ (for $z\to\infty$ ), to obtain

[TABLE]

We can do the remaining integral over $y_{0}$ via integration by parts, giving

[TABLE]

Hence

[TABLE]

To proceed we need the leading orders of $g(x)$ . These are derived in D:

[TABLE]

Our asymptotic equation for $\Phi(x)$ thereby becomes

[TABLE]

Inspection of this equation shows that the leading orders of the solution are

[TABLE]

or

[TABLE]

This remarkably simple expression, linking the true and the inferred integrated base hazard rates $\Lambda(t)$ and $\Lambda_{0}(t)$ , predicts that the relation between the two should approach a straight line when shown in a log-log plot. It is not only confirmed by simulations for large times (for which it was derived from our theory) but is in fact found to be quite accurate for all times. This is shown in Figure 4, and forms the basis of our variational approximations below.

4.3 Variational approximation

The main complexity of the RS theory is in solving the functional order parameter equation (82). This is the motivation for investigating variational approximations for $\Lambda(t)$ . Since our equations were obtained by solving an extremization problem, variational approaches are in the present context both natural and conceptually straightforward. The simulation data in Figure 4 suggest writing the functional order parameter in the form $\Lambda(t)=k\Lambda^{\rho}_{0}(t)$ . To compute the new scalar order parameters $k$ and $\rho$ we substitute this expression for $\Lambda(t)$ into the quantity (3.2) to be extremized. As before we then put $u=\tilde{u}/\sqrt{\gamma}$ and take the limit $\gamma\to\infty$ , and find that we need to extremize the following quantity over $(\tilde{u},v,w,k,\rho)$ :

[TABLE]

in which

[TABLE]

It is now easy to derive our order parameter equations, since all contributions to partial derivatives that involve $\varphi(wy_{0}\!+\!vz,t)$ vanish, by virtue of $\varphi(wy_{0}\!+\!vz,t)$ maximising the factor between the square brackets. Extremizing (93) over $(\tilde{u},v,w)$ recovers our earlier equations (76,77,78), with $L(t)=k\tilde{u}^{2}\rme^{\tilde{u}^{2}}\Lambda^{\rho}_{0}(t)$ , as expected. Extremizing (93) over the new order parameters $k$ and $\rho$ gives:

[TABLE]

Using $W(z)\exp[W(z)]=z$ and our definition of $L(t)$ , these two equations can be rewritten as

[TABLE]

In the second equation we rewrite the term with the explicit factor $y_{0}$ , using

[TABLE]

We thus arrive at five relatively simple closed equations from which to solve $(\tilde{u},v,w,k,\rho)$ in our variational approximation. Upon substituting the definition $t(s,y_{0})=\Lambda_{0}^{\rm inv}(\rme^{-Sy_{0}}\log(1/s))$ we can simplify the argument of Lambert’s $W$ -function, which appears in all equations, further to

[TABLE]

This enables us to combine the two Gaussian integrals appearing in each order parameter equation by a single zero-average Gaussian integral, with width

[TABLE]

We finally transform the variational order parameter $k$ to $q=k\tilde{u}^{2}\rme^{\tilde{u}^{2}}$ , and evaluate $\int\!\rmd t~{}p(t)\log\Lambda_{0}(t)=\int_{0}^{\infty}\!\rmd x~{}\rme^{-x}\log x=-C_{\rm E}$ [38], which involves Euler’s constant $C_{\rm E}=0.5772156649015\ldots$ . We then obtain

[TABLE]

In the same way we can work out the value of $E(S,\lambda_{0})$ for the variational solution, and find:

[TABLE]

For $q\to 0$ we may replace $W(q\rme^{\sigma x}\log^{\rho}(1/s))\approx q\rme^{\sigma x}\log^{\rho}(1/s)$ and use the integral $\int_{0}^{1}\!\rmd s~{}\log(1/s)\log\log(1/s)=1-C_{\rm E}$ , to recover after some simple expansions the correct $\zeta\to 0$ solution: $\lim_{\zeta\to 0}v=\lim_{\zeta\to 0}\tilde{u}=0$ , $\lim_{\zeta\to 0}w=S$ , $\lim_{\zeta\to 0}\rho=\lim_{\zeta\to 0}k=1$ , and $\lim_{\zeta\to 0}E(S,\lambda_{0})=0$ .

We observe that our above closed variational equations (102–106) are completely independent of the true base hazard rate $\lambda_{0}(t)$ . Hence they predict that the key quantities required for overfitting correction in the Cox model (the slope of the data cloud, and the deformation parameters of the base hazard rate) are independent of the true shape of the base hazard rate.

The easiest protocol for solving our equations numerically is to regard $q$ as an independent parameter, and compute $(\zeta,v,w,\tilde{u},\rho)$ for each $q$ by iterative mapping. Upon doing so (see Figure 5), one finds that the solution always exhibits $\rho=w/S$ , within numerical accuracy limitations. We have not yet been able to confirm this analytically, as that would require proving that the solution of our equation obeys

[TABLE]

but it is for small $\zeta$ in agreement with (91) (as it should be). If $\rho=w/S$ is indeed generally true for the solution of our variational equations, it implies that $\rho$ is identical to the slope of the data clouds in Figure 2, and that the values of $(v,\rho,q)$ (hence also of the slope and the width of the data clouds in Figure 2) are not only independent of $\lambda_{0}(t)$ but also independent of $S$ . It would also allow us to obtain a more compact closed theory in terms of just three scalar order parameters, as we will show now. Upon making directly the variational ansatz $\Lambda(t)=k\Lambda^{\rho}_{0}(t)$ with $w=\rho S$ , we need to extremize

[TABLE]

in which again $\varphi(\eta,t)=\tilde{u}-\tilde{u}^{-1}W(k\tilde{u}^{2}\rme^{\tilde{u}^{2}+\eta}\Lambda^{\rho}_{0}(t))$ . Following similar manipulations as used for the first variational analysis, and with the previous short-hand $q=k\tilde{u}^{2}\rme^{\tilde{u}^{2}}$ , we find upon extremization of $\Psi(\tilde{u},v,k,\rho)$ and after elimination of $\tilde{u}$ the following three closed equations for $(v,k,\rho)$ :

[TABLE]

Upon solving the trio (110,111,112), the values of $\tilde{u}$ , $w$ and $k$ then follow via

[TABLE]

Finally we note that all our equations in this section can also be written in a form that involves only integrations over the interval $[0,1]$ , using the general identity

[TABLE]

It is instructive at this stage to test the predictions of the above simple variational equations (110,111,112) against numerical simulations of Cox regression on synthetic data. According to (41,42,43), we must expect to find in our simulations that $v=\lim_{r,N\to\infty}v(r,N)$ and $w=\lim_{r,N\to\infty}w(r,N)$ , where

[TABLE]

Here $\{\hat{\beta}_{\mu}\}$ denotes the inferred values of the (rescaled) regression parameters, and the averages $\langle\ldots\rangle_{\mathscr{D}}$ are over $r$ randomly generated data sets. The results of measuring $v(r,N)$ and $w(r,N)$ in numerical simulations are shown in Figure 6 together with the variational predictions. In spite of the modest values in our simulations of $N=200$ and the finite number of training sets over which inferred parameters are averaged in evaluating (115,116) (which one expects to generate excess variability), the agreement between the variational predictions and the simulations is seen to be surprisingly good.

5 Tests and applications

We will now test the variational RS theory (110,111,112) further against numerical simulations, focusing on the the dependence on the ratio $\zeta$ of the main characteristics of the regression parameter data clouds of Figure 2 (i.e. their slope $\kappa$ and their width $\sigma$ ), and of the integrated base hazard rates as shown e.g. in Figure 3. We know (46) that the theory predicts $\kappa=\rho$ and $\sigma=v/\sqrt{p}$ (for the standard scaling convention of the Cox model [5], i.e. for $p(t|\mbox{\boldmath$ z $})=-\frac{\rmd}{\rmd t}\exp[-\exp(\mbox{\boldmath$ \beta $}\cdot\mbox{\boldmath$ z $})\Lambda(y)]$ ), and these predictions are plotted in Figure 7 as solid lines, together with the values obtained in regression simulations of the Cox model on synthetic data (markers), for $N=200$ and $N=400$ , and for two distinct choices for the true base hazard rate $\lambda_{0}(t)$ . Modulo finite size effects, which increase as we approach the phase transition point $\zeta=1$ , there is again good agreement between theory and simulations. The data confirm also the prediction of the variational theory that both $\kappa$ and $\sigma$ are independent of the true base hazard rate $\lambda_{0}(t)$ .

In Figure 8 we compare the inferred integrated base hazard rates $\hat{\Lambda}(t)$ , obtained for synthetic data with $N=400$ , with the predictions of the variational RS theory (110,111,112), for two choices of the base hazard rate. The agreement is satisfactory for times of the order of the typical event times in the data. For larger times (where the theory has to extrapolate to times where available data are at best sparse) one observes increasing deviations, with the variational theory underestimating the impact of overfitting; this is indeed consistent with (92), since the variational approximation captures only the first (leading) term of the exact expansion (92). We can in principle obtain more accurate integrated base hazard rate predictions within the current framework, but this requires that we either solve (numerically) the full RS equations (76,77,78,79), or develop a more refined variational ansatz for the function $L(t)$ .

We found in our simulations that as the ratio $\zeta=p/N$ increases, higher numerical precision is required in solving Cox’s equations. For values $N\sim 10^{2}-10^{3}$ and $\zeta>0.4$ , using conventional C-code compiled with gcc at double floating point precision (data type ‘double’) will occasional lead to degeneracies in the equations that cause the association parameters $\hat{\mbox{\boldmath$ \beta $}}$ to be ill-defined. Upon switching to quadruple floating point precision (data type ‘long double’) these degeneracies disappeared.

The present RS theory has so far been tested only for ‘normal’ regimes for the parameter $S$ , which represents the typical width of the sum $\sum_{\mu}\beta_{\mu}^{\star}z_{\mu}/\sqrt{p}$ , and hence the typical scale of the covariate-conditioned hazard rates. It turns out that upon carrying out Cox regression for synthetic survival data with large values of $\zeta$ and very large values of $S$ , we observe ergodicity breaking: upon plotting true versus inferred association parameters, as in Figure 3, for different simulation experiments with the same parameters $N$ and $p$ , we now find multiple data clouds with distinct slopes, as opposed to a single data cloud with unique reproducible characteristics. This suggest that the relevant saddle points in the replica calculation will no longer be replica-symmetric. This phenomenology, of which examples are shown in Figure 9, can be studied in a natural way within the replica formalism, but it requires so-called RSB (replica symmetry breaking) ansätze for the overlap matrix $C$ . One anticipates that for sufficiently large values of $\zeta$ there may be a critical value of $S/\sqrt{p}$ that marks an RSB transition, i.e. the onset of non-ergodicity; the preliminary data in Figure 9 suggest that this critical value may also depend on the shape of the true base hazard rate. Computing these critical values of $S$ from the replica formalism, in terms of the parameters $\zeta$ , $S$ and $\lambda(t)$ , will be the subject of a future study.

6 Discussion

The Cox model has been by far the most popular and effective statistical tool for the analysis of time-to-event data in medicine, since its publication nearly half a century ago. However, the demands on statistical methods in 21st century medicine are changing. We can now take measurements on individual patients of unprecedented dimensionality $p$ , such as gene expressions and high-resolution imaging data, but the typical number of samples $N$ in our medical data bases has not grown in proportion. As a result, the condition for maximum likelihood (ML) multivariate regression methods (including the model of Cox) to be applicable, being $p/N\ll 1$ in order to avoid overfitting, is nowadays very often not met. Apart from a few early (and modest) simulation experiments, there appear not to have been any published studies aimed at modelling mathematically the mechanism of overfitting in Cox regression, which is a prerequisite for the development of methods to deal with the overfitting problem. When the dimensionality of the data, relative to the number of available samples, is too high to justify using the multivariate Cox model, medical statisticians and epidemiologists are presently left having to resort to poor alternatives for proper regression: they can either limit a priori the number of covariates used in regression (and thereby limit outcome prediction potential), or switch to univariate analysis (which is undesirable since we know that univariate estimates of association parameters correlate poorly with their multivariate counterparts), or work with so-called ‘risk signatures’ (which tend to involve ad-hoc definitions, and ad-hoc recipes for interpretation). Thus, expensive and potentially informative high-dimensional clinical data remain under-utilised.

Our regression simulations with synthetic survival data show clearly that the mechanism of overfitting in Cox regression is surprisingly reproducible and consistent: it always leads to a clear bias, which reports association parameter values that are more extreme than their true values, underestimates base hazard rates for short times, and over-estimates base hazard rates for large times. This consistency suggests that it must in principle be possible to model overfitting mathematically, and that (if such modelling is successful) one should be able to correct the outcomes of Cox regression systematically for the impact of overfitting. This, in turn, would allow us to do multivariate regression reliably for significantly larger ratios of the number of covariates over the number of samples, and obtain more accurate and reproducible predictions of clinical outcomes.

In this paper we have presented such a theory, which is built on the mathematical methods of statistical mechanics and inspired by Gardner’s famous analysis of binary classifiers [22]. It assumes that $N$ is large, but with $p/N$ finite, and it combines three ideas: (i) the formulation of an information-theoretic measure of overfitting in time-to-event regression, (ii) translating the calculation of this quantity into computing the ground state of a statistical mechanical system, and (iii) dealing with the heterogeneity in the problem (here: the realisation of the data set) with the replica method. Our modeling approach is generic. It is developed initially for arbitrary parametrised time-to-event regression models, but we devote most of our paper to the Cox model, in recognition of its importance and dominance in the medical statistics field. We show that by combining the above three ideas, it is possible to derive explicit macroscopic equations, exact in the asymptotic limit, with which to characterise the regression process for finite values of the ratio $p/N$ . In this paper we assume that the regression process is ergodic, and make the so-called replica symmetric (RS) ansatz for the solution of our equations; this assumption is supported by numerical simulations, provided the true association parameters are not too large.

For the Cox model, the order parameters of the RS theory contain all the relevant information required to quantify the impact of overfitting, but since one of them is a function (the inferred integrated base hazard rate), we introduced a suitable variational approximation, which resulted in a much simpler three-parameter theory. The simplified theory makes various qualitative predictions that are confirmed by regression simulations with synthetic data: that the ‘inflation’ of inferred association parameters is independent of the amplitude of the true association parameters and of the true base hazard rate, that there is a phase transition when $p/N\to 1$ , that the base hazard rate is underestimated for short times and over-estimated for large times, and that the relation between inferred and true integrated base hazard rate is for large times of the form $\log\hat{\Lambda}(t)\sim\rho\log\Lambda_{0}(t)$ , with a parameter $\rho$ that increases with the ratio $\zeta=p/N$ . The quantitative agreement between our variational theory and regression simulations with synthetic data is generally very good, modulo finite size fluctuations, including the predicted overfitting-induced bias in association parameters. The only exception is the integrated base rate at large times, where available data are sparse, and where the variational ansatz (which incorporates only the leading order time dependence) under-estimates the impact of overfitting. Upon increasing the values of $\zeta$ and $S$ , we observe new phenomenology, such as ergodicity breaking in the regression process (which requires order parameters with broken replica symmetry, or RSB). The calculation of the RSB transition line will be the subject of a subsequent paper.

The present study represents only a first step. It demonstrates that it is possible to model overfitting in Cox regression mathematically, using the replica formalism. We envisage many direct extensions, such as increasing the precision of our predictions by constructing full non-variational solutions to our RS order parameter equations (analytically or numerically), the incorporation of censoring, and the addition of MAP-type regulariser terms. More technical potential follow-up studies could investigate RSB phenomena, including the calculation of the ergodicity breaking transition line, or the impact of having covariate distributions for which the sums $\sum_{\mu}\beta_{\mu}z_{\mu}$ no longer have Gaussian statistics. Casting the net somewhat wider, and given our more general initial formulation of the theory, we expect that there will be other survival analysis models for which a similar overfitting analysis can be done.

Last but certainly not least, we would now like to explore the potential of our methodology to provide practical tools with which to correct multivariate Cox regression analyses of real time-to-event data in medicine for the impact of overfitting. Such tools could be used retrospectively, to determine objectively which past results in the medical literature that were obtained with the Cox method can be trusted, and which perhaps cannot. They should hopefully also lead to more accurate clinical outcome predictions in the future, by allowing medical statisticians to include more covariates in multivariate regression by default, without overfitting danger, and enable the construction of sample size tables for multivariate regression that allow overfitting effects to be taken into account in the design of clinical trials. The results presented in this paper suggest that in the near future, with proper overfitting corrections, reliable multivariate regression for time-to-event data at ratios of up to $p/N\approx 0.5$ or higher will be quite feasible.

Acknowledgements

We would like to thank Bryan Lutchmanen for contributing to the regression simulation studies, and Anita Grigoriadis for the data used to produce Figure 1. We are also grateful for support from Saddle Point Science, the Engineering and Physical Sciences Research Council (EPSRC), and the Medical Research Council (MRC) of the United Kingdom.

References

[1]

Hougaard P 2001 Analysis of Multivariate Survival Data (New York: Springer)

[2]

Klein JP and Moeschberger ML 2003 Survival Analysis - Techniques for Censored and Truncated Data (New York: Springer)

[3]

Ibrahim JG, Chen MH and Sinha D 2010 Bayesian Survival Analysis (New York: Springer)

[4]

Crowder M 2012 Multivariate Survival Analysis and Competing Risks (London: CRC Press)

[5]

Cox DR 1972 J. Roy. Stat. Soc. B 34 187

[6]

Witten DM and Tibshirani R 2009 J. Roy. Stat. Soc. B 71 615

[7]

Witten DM and Tibshirani R 2010 Stat. Meth. Med. Res. 19 29

[8]

Keiding N, Andersen PK and Klein JP 1997 Statistics in Medicine 16 215

[9]

Vaida F and Xu R 2000 Statistics in Medicine 19 3309

[10]

Duchateau L and Jansen P 2008 The Frailty Model (Statistics for Biology and Health) (New York: Springer)

[11]

Wienke A 2010 Frailty Models in Survival Analysis (CRC Biostatistics Series) (Boca Raton: Chapman & Hall)

[12]

Rowley M, Garmö H, Van Hemelrijck M, Wulaningsih W, Grundmark B, Zethelius B, Hammar N, Walldius G, Inoue M, Holmberg L and Coolen ACC 2017 Statistics in Medicine DOI: 10.1002/sim.7246

[13]

Grigoriadis A, Gazinzwa P, Pai T, Irshad S, Wu Y, Naidoo K, Millis R, Gillett CE, Tutt A, Coolen ACC and Pinder S 2017 manuscript under review

[14]

Concato J, Feinstein AR and Holford TR 1993 Annals of Internal Medicine 118 201

[15]

Babyak MA 2004 Psychosomatic Medicine 66 411

[16]

Breslow NE 1972 Discussion section of the paper [5] by DR Cox

[17]

MacKay DJC 2003 Information Theory, Inference and Learning Algorithms (Cambridge: University Press)

[18]

Coolen ACC, Kühn R and Sollich P 2005 Theory of Neural Information Processing Systems (Oxford: University Press)

[19]

Peduzzi P, Concato J, Feinstein AR and Holford T 1995 J. Clin. Epidemiol. 48 1503

[20]

Kawada T 2011 Int. J. Cardiol. 153 110

[21]

Dobbin KK and Song X 2013 Biostatistics 14 639

[22]

Gardner E 1987 Europhys. Lett. 4 481

[23]

Sherrington D and Kirkpatrick S 1975 Phys. Rev. Lett. 35 1792

[24]

Parisi G 1979 Phys. Lett. A 73 203

[25]

Mézard M, Parisi G and Virasoro M A 1987 Spin glass theory and beyond (Singapore: World Scientific)

[26]

Monasson R 1998 J. Phys. A: Math. Gen. 31 513

[27]

Van Mourik J and Coolen ACC 2001 J. Phys. A: Math. Gen. 34 L111

[28]

Nishimori H 2001 Statistical Physics of Spin Glasses and Information Processing (Oxford: University Press)

[29]

Amit DJ, Gutfreund H and Sompolinsky H 1985 Phys. Rev. A 32 1007

[30]

Rabello S, Coolen ACC, Pérez-Vicente CJ and Fraternali F 2008 J. Phys. A: Math. Theor. 41 285004

[31]

Agliari E, Annibale A, Barra A, Coolen ACC and Tantari D 2013 J. Phys. A: : Math. Theor. 46 415003

[32]

Challet D, Marsili M and Zecchina R 2000 Phys. Rev. Lett. 84 1824

[33]

Marsili M and Challet D 2001 Phys. Rev. E 64 056138

[34]

Cover TM and Thomas JA 1991 Elements of Information Theory (New York: Wiley)

[35]

Corless RM, Gonnet GH, Hare DEG, Jeffrey DJ and Knuth DE 1996 Adv. Comp. Math. 5 329

[36]

Asmussen S, Jensen JL and Rojas-Nandayapa L 2015 Methodol. Comput. Appl. Prob. DOI 10.1007/s11009-014-9430-7

[37]

Derrida B 1981 Phys. Rev. B 24 2613

[38]

Gradshteyn IS and Rhyzik IM 1979 Table of Integrals, Series and Products (London: Academic Press)

Appendix A Covariate correlations in Cox regression

In the absence of censoring, the equations from which to compute the inferred base hazard rate $\hat{\lambda}(t)$ and the inferred association parameters $\hat{\mbox{\boldmath$ \beta $}}\in{\rm I\!R}^{p}$ in Cox regression are the following [5]:

[TABLE]

Let us define the average values and correlations of the covariates as $\langle\mbox{\boldmath$ z $}\rangle=\bar{\mbox{\boldmath$ z $}}$ and $\langle(z_{\mu}\!-\!\bar{z}_{\mu})(z_{\nu}\!-\!\bar{z}_{\nu})\rangle=A_{\mu\nu}$ , with $\langle f(\mbox{\boldmath$ z $})\rangle=N^{-1}\sum_{i=1}^{N}f(\mbox{\boldmath$ z $}_{i})$ . We can then simply write the original $\{\mbox{\boldmath$ z $}_{i}\}$ in terms of zero-average and uncorrelated covariate vectors $\{\tilde{\mbox{\boldmath$ z $}}_{i}\}$ , by writing $\mbox{\boldmath$ z $}_{i}=\bar{\mbox{\boldmath$ z $}}+\mbox{\boldmath$ A $}^{\frac{1}{2}}\tilde{\mbox{\boldmath$ z $}}_{i}$ . The equation for the regression parameters thereby becomes

[TABLE]

Hence $\hat{\mbox{\boldmath$ \beta $}}=\mbox{\boldmath$ A $}^{-\frac{1}{2}}\tilde{\mbox{\boldmath$ \beta $}}$ , in which $\tilde{\mbox{\boldmath$ \beta $}}$ is the regression outcome of the Cox method applied to the zero-average, uncorrelated and normalized covariates $\{\tilde{\mbox{\boldmath$ z $}}_{i}\}$ , i.e.

[TABLE]

Similarly, for the base hazard rate we find:

[TABLE]

Hence $\hat{\lambda}(t)=\tilde{\lambda}(t)\exp(-\tilde{\mbox{\boldmath$ \beta $}}\!\cdot\!\mbox{\boldmath$ A $}^{-\frac{1}{2}}\bar{\mbox{\boldmath$ z $}})$ , in which $\tilde{\lambda}(t)$ is given by Breslow’s formula (the regression outcome for the base hazard rate of the Cox method) applied once more to the zero-average uncorrelated and normalised covariates $\{\tilde{\mbox{\boldmath$ z $}}_{i}\}$ , i.e.

[TABLE]

We conclude that for the Cox model one can always express the regression outcomes for any choice of covariate vectors in terms of the regression outcomes for zero-average, normalized and uncorrelated covariates, where $\langle z_{\mu}\rangle=0$ and $\langle z_{\mu}z_{\nu}\rangle=\delta_{\mu\nu}$ .

Appendix B Deriviation of the replica symmetric equations

Assuming replica symmetry to hold converts our problem into calculating

[TABLE]

To proceed we need the determinant and inverse of the $(n\!+\!1)\times(n\!+\!1)$ covariance matrix $C$ , and the determinant of the $n\times n$ matrix $\mbox{\boldmath$ C $}^{\prime}$ . Both $C$ and $\mbox{\boldmath$ C $}^{-1}$ will inherit the assumed replica-symmetric (RS) structure of the saddle-point. Hence they must have the respective forms

[TABLE]

The RS eigenvectors $x$ and eigenvalues $\mu$ of $C$ are calculated easily:

[TABLE]

It follows that

[TABLE]

We obtain the parameters $(D,d,d_{00},d_{0})$ by multiplying the two matrices in (135) and demanding that this gives the identity matrix. After some simple algebra this results in:

[TABLE]

It is now a trivial matter to calculate also the quantity $\log{\rm Det}\mbox{\boldmath$ C $}^{\prime}$ , since the RS form of $C$ implies that for $\alpha,\rho=1\ldots n$ we have $C^{\prime}_{\alpha\rho}=\delta_{\alpha\rho}(C\!-\!c)+c-(c_{0}/S)^{2}$ . It has one eigenvector $(1,\ldots,1)$ with eigenvalue $C\!-\!c\!-\!nc_{0}^{2}/S^{2}+nc$ , and an $(n\!-\!1)$ -fold degenerate eigenspace with eigenvalue $C\!-\!c$ . Hence

[TABLE]

Inserting these results into (B) gives, with the short-hand ${\rm D}y=(2\pi)^{-1/2}\rme^{-\frac{1}{2}y^{2}}\rmd y$ , and upon carrying out successive Taylor expansions for small $n$ :

[TABLE]

This expression takes a simpler form if we introduce the following transformation of the trio $\{C,c,c_{0}\}$ to new non-negative variables $\{u,v,w\}$ :

[TABLE]

with inverse transformation

[TABLE]

With these definitions, and upon removing terms that vanish upon differentiation by $\gamma$ , we can summarise the current state of our RS calculations for the stochastic generalization of the Cox model, in the limit of large data sets, by the following compact expression:

[TABLE]

If we transform $y\to y+(wy_{0}+vz)/u$ , we can write this result equivalently as

[TABLE]

At the relevant saddle point, the order parameter derivative of the function that is being extremized will by definition be zero, so

[TABLE]

in which the order parameters $\{u,v,w;\lambda\}$ are to be evaluated at the saddle point of

[TABLE]

Appendix C The limits $\zeta\to 0$ and $\zeta\to 1$

For $\zeta\to 0$ , the limit of no overfitting, we immediately find from (66,70) that $\tilde{u},v\to 0$ . To find also $w$ and $\lambda(t)$ we need to go to the next order in $\zeta$ , using $W(z)=z+{\mathcal{O}}(z^{2})$ . This results in

[TABLE]

It follows that $v={\mathcal{O}}(\tilde{u})$ and $\tilde{u}={\mathcal{O}}(\sqrt{\zeta})$ for $\zeta\to 0$ , and that $\lim_{\zeta\to 0}w$ and $\lim_{\zeta\to 0}\lambda(t)$ are to be solved from the following two coupled equations:

[TABLE]

After some simple rewriting and integration by parts over time, they take the alternative forms

[TABLE]

From this we immediately confirm the correct solution $\lim_{\zeta\to 0}w=S$ and $\lim_{\zeta\to 0}\lambda(t)=\lambda_{0}(t)$ , which describes perfect inference, as expected for $\zeta\to 0$ . From the pair (47,48) we also find the correct corresponding value for $\lim_{\zeta\to 0}\lim_{\gamma\to\infty}E_{\gamma}(S,\lambda_{0})$ :

[TABLE]

Next we turn to the limit $\zeta\to 1$ . Here it follows from (70) that $\tilde{u}\to\infty$ , and we need the expansion of $W(z)$ for large arguments, i.e. $W(z)=\log z-\log(\log z)+\ldots$ . With a modest amount of foresight we make the ansatz $\tilde{u}=\kappa/\sqrt{1-\zeta}+{\mathcal{O}}(\log(1/(1-\zeta))$ and $v,w={\mathcal{O}}(\log(1/(1-\zeta))$ for $\zeta\to 1$ . Using

[TABLE]

our $\gamma\to\infty$ order parameter equations then give

[TABLE]

Our scaling ansatz is seen to be consistent with the three scalar order parameter equations. Hence $\tilde{u}$ , $v$ and $w$ all diverge at a phase transition point $\zeta=1$ , whereas for the functional order parameter equation we find in the limit $\zeta\to 1$ :

[TABLE]

From this it follows after differentiation that $\frac{\rmd}{\rmd t}[p(t)\Lambda(t)/\lambda(t)]=0$ , and after some further manipulations one arrives at the following degenerate solution for $\Lambda(t)$ :

[TABLE]

Apparently, as one varies the ratio $\zeta$ of the number of covariates over the number of samples in the deterministic Cox model, the integrated inferred base hazard rate changes from the correct shape $\Lambda_{0}(t)$ at $\zeta=0$ to a step function at the phase transition point $\zeta=1$ , with the discontinuity at some time point $\tau$ that should follow from inspecting sub-leading orders in $1-\zeta$ . Moreover, at this transition (if not even earlier) one expects to find breaking of the assumed replica symmetry.

Appendix D Asymptotic form of the event time distribution

Here we calculate the asymptotic form of the function $g(x)=\int\!{\rm D}y~{}\rme^{Sy-x\exp(Sy)}$ for $x\to\infty$ , and derive expression (88). Working out the definition gives

[TABLE]

with

[TABLE]

Differentiation shows that the function $\varphi(y,\eta)$ is mimimal at $y=-W(\eta S^{2})$ , where $W(z)$ is Lambert’s $W$ -function [35]. Expansion of $\varphi(y,\eta)$ around its minimum gives:

[TABLE]

This leads to the following Gaussian approximation of the integral over $y$ :

[TABLE]

Application to $\eta=x\rme^{S^{2}}$ then gives:

[TABLE]

Finally, for $x\to\infty$ we can use $W(z)=\log z-\log\log z+{\mathcal{O}}(\log\log z/\log z)$ to obtain

[TABLE]

Bibliography38

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Hougaard P 2001 Analysis of Multivariate Survival Data (New York: Springer)
2[2] Klein JP and Moeschberger ML 2003 Survival Analysis - Techniques for Censored and Truncated Data (New York: Springer)
3[3] Ibrahim JG, Chen MH and Sinha D 2010 Bayesian Survival Analysis (New York: Springer)
4[4] Crowder M 2012 Multivariate Survival Analysis and Competing Risks (London: CRC Press)
5[5] Cox DR 1972 J. Roy. Stat. Soc. B 34 187
6[6] Witten DM and Tibshirani R 2009 J. Roy. Stat. Soc. B 71 615
7[7] Witten DM and Tibshirani R 2010 Stat. Meth. Med. Res. 19 29
8[8] Keiding N, Andersen PK and Klein JP 1997 Statistics in Medicine 16 215

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Replica analysis of overfitting in regression models for time-to-event data

Abstract

pacs:

Contents

1 Introduction

2 Overfitting in Maximum Likelihood models for survival analysis

2.1 Definitions

2.2 An information-theoretic measure of under- and overfitting

2.3 Analytical evaluation of the average over data sets

2.4 Application to Cox regression

3 Asymptotic analysis of overfitting in the Cox model

3.1 Conversion to a saddle-point problem

3.2 Replica symmetric extrema

3.3 Physical interpretation of order parameters

3.4 Derivation of RS saddle point equations

4 Analysis of the RS equations for the Cox model

4.1 RS equations in the limit γ→∞\gamma\to\inftyγ→∞

4.2 Numerical and asymptotic solution of RS equations

4.3 Variational approximation

5 Tests and applications

6 Discussion

References

Appendix A Covariate correlations in Cox regression

Appendix B Deriviation of the replica symmetric equations

Appendix C The limits ζ→0\zeta\to 0ζ→0 and ζ→1\zeta\to 1ζ→1

Appendix D Asymptotic form of the event time distribution

4.1 RS equations in the limit $\gamma\to\infty$

Appendix C The limits $\zeta\to 0$ and $\zeta\to 1$