Replica analysis of overfitting in regression models for time-to-event data
ACC Coolen, JE Barrett, P Paga, CJ Perez-Vicente

TL;DR
This paper develops a mathematical theory using the replica method to analyze and correct overfitting in regression models for time-to-event data, addressing a critical challenge in high-dimensional survival analysis.
Contribution
It introduces a novel application of the replica method to quantify and mitigate overfitting in survival regression models, including Cox's proportional hazards model.
Findings
The theory accurately predicts overfitting effects in Cox models.
Provides practical tools for correcting overfitting in survival analysis.
Enhances understanding of overfitting in high-dimensional clinical data.
Abstract
Overfitting, which happens when the number of parameters in a model is too large compared to the number of data points available for determining these parameters, is a serious and growing problem in survival analysis. While modern medicine presents us with data of unprecedented dimensionality, these data cannot yet be used effectively for clinical outcome prediction. Standard error measures in maximum likelihood regression, such as p-values and z-scores, are blind to overfitting, and even for Cox's proportional hazards model (the main tool of medical statisticians), one finds in literature only rules of thumb on the number of samples required to avoid overfitting. In this paper we present a mathematical theory of overfitting in regression models for time-to-event data, which aims to increase our quantitative understanding of the problem and provide practical tools with which to correct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Replica analysis of overfitting in regression models for time-to-event data
ACC Coolen*†‡, JE Barrett§, P Paga†‡*, CJ Perez-Vicente¶
Department of Mathematics, King’s College London,
The Strand, London WC2R 2LS, UK
Institute for Mathematical and Molecular Biomedicine, King’s College London,
Hodgkin Building, Guy’s Campus, London SE1 1UL, UK
Department of Primary Care and Public Health Sciences, King’s College
London, Addison House, Guy’s Campus, London SE1 1UL, UK
Departament de Física Fonamental, Universitat de Barcelona,
08028 Barcelona, Spain
[[email protected], [email protected],
[email protected], [email protected]](mailto:[email protected],%[email protected],)
Abstract
Overfitting, which happens when the number of parameters in a model is too large compared to the number of data points available for determining these parameters, is a serious and growing problem in survival analysis. While modern medicine presents us with data of unprecedented dimensionality, these data cannot yet be used effectively for clinical outcome prediction. Standard error measures in maximum likelihood regression, such as p-values and z-scores, are blind to overfitting, and even for Cox’s proportional hazards model (the main tool of medical statisticians), one finds in literature only rules of thumb on the number of samples required to avoid overfitting. In this paper we present a mathematical theory of overfitting in regression models for time-to-event data, which aims to increase our quantitative understanding of the problem and provide practical tools with which to correct regression outcomes for the impact of overfitting. It is based on the replica method, a statistical mechanical technique for the analysis of heterogeneous many-variable systems that has been used successfully for several decades in physics, biology, and computer science, but not yet in medical statistics. We develop the theory initially for arbitrary regression models for time-to-event data, and verify its predictions in detail for the popular Cox model.
pacs:
05.70.Fh, 02.50.-r
Contents
[TABLE]
1 Introduction
In the simplest possible scenario, survival analysis is concerned with data of the following form. We consider a cohort of individuals, each of whom are at risk of a specified irreversible event, such as the onset of a given disease or death. For each individual in this cohort we are given specific measurements \mbox{\boldmathz}_{i}=(z_{i1},\ldots,z_{ip}) (the covariates) which were taken at a baseline time , as well as the time at which for individual we either observed the irreversible event, or we ceased our observation without having observed the event yet (the latter case is called ‘censoring’). More complex scenarios could involve e.g. having multiple distinct risk types, such as distinct causes of death, or interval censoring, where rather than itself, one is given an interval that contains . The theory developed in this paper can be generalised without serious difficulty to include such extensions, but in the interest of transparency we will focus for now strictly on the simplest case.
[TABLE]
x\vdots$$i\!=\!1$$i\!=\!2$$\vdots$$\bulletxxxi\!=\!N$$i\!=\!N\!-\!1$$t\!=\!0
The aim of survival analysis is regression, i.e. to use our data for detecting and quantifying probabilistic patterns (if any) that relate an individual’s failure time to their covariates . Such patterns may allow us to predict individual patients’ clinical outcomes, distinguish between high-risk and low-risk patients, reveal general disease mechanisms, or design new data-driven therapeutic interventions (by changing the values of modifiable covariates). For general reviews of the considerable survival analysis literature we refer to textbooks such as [1, 2, 3, 4]111Non-medical applications of survival analysis include e.g. the study of the time to component failure in manufacturing, or of the duration of unemployment in economics.. Being able to use the extracted patterns to predict clinical outcomes for unseen patients is the only reliable test of whether our regression results represent true knowledge. Accurate prediction requires that we use as much of the available covariate information as possible, so our focus must be on multivariate regression methods.
Most multivariate survival analysis methods are based on postulating a suitable and plausible parametrisation of the covariate-conditioned event time distribution, whose parameters are estimated from the data via either the maximum likelihood protocol (ML), or (following Bayesian reasoning) via maximum a posteriori probability (MAP). The most popular parametrisation is undoubtedly the proportional hazards model of Cox [5], which uses ML inference, and assumes the event time distribution to be of the so-called proportional hazards form p(t|\mbox{\boldmathz})=-\frac{\rmd}{\rmd t}\exp[-\exp(\mbox{\boldmath\beta}\!\cdot\!\mbox{\boldmathz})\Lambda(t)]. MAP versions of [5] are the so-called ‘penalised Cox’ or ‘ridge’ regression models (with Gaussian parameter priors), see e.g. [6, 7]. More complex parametrisation proposals, such as frailty or random effects models [8, 9, 10, 11] or latent class models [12], still tend to have proportional hazards type formulae as their building blocks. In all such models the number of parameters is always larger than or equal to the number of covariates. Hence, to avoid overfitting they can be used safely only when . This limitation was harmless in the 1970s and 1980s, when many of the currently used models were devised, and where one would typically have datasets with at most. For the data of post-genome medicine, however, where we regularly have , it poses a serious problem which has for instance prevented us from using genomic covariates in rigorous multivariate regression protocols, forcing us instead to work with ‘gene signatures’.
Overfitting in survival analysis models [14, 15] can be visualized effectively by combining regression with cross-validation. For the Cox model, for instance, one can use the inferred association parameters of [5] in combination with Breslow’s [16] estimator for the base hazard rate (which is the canonical estimator for [5]), to predict whether an event will have happened by a given cutoff time, and compare the fraction of correct predictions in the training set (the data used for regression) to those in a validation set (the unseen data). When drawn as functions of the number of covariates used, the resulting curves typically exhibit the standard fingerprints of overfitting [17, 18]; see Figure 1. Simulations with synthetic data [19] showed that the optimal number of covariates in Cox regression (see arrows in Figure 1) tends to be roughly proportional to the number of samples . Given this observed phenomenology, it seems vital before doing multivariate regression to have a tool for estimating the minimum number of samples or events needed to avoid the overfitting regime. To our knowledge, there is no theory in the literature yet for predicting this number, not even for the Cox model [5]. One finds only rules of thumb – e.g. the number of failure events must exceed 10 times the number of independent covariates – and empirical bootstrapping protocols, often based on relatively small scale simulation data [19, 20, 21]. This situation is not satisfactory.
To increase our intuition for the problem, we first explore via simple simulation studies the relation between inferred and true parameters in Cox’s model [5]. The parameters of [5] are the vector \mbox{\boldmath\beta}=(\beta_{1},\ldots,\beta_{p}) of regression coefficients (where is the number of covariates), and the base hazard rate . We generated association parameters and covariates randomly from zero-average Gaussian distributions, and corresponding synthetic survival data using Cox’s model without censoring (so all samples correspond to failure events), for different base hazard rates. To understand the nature of the overfitting-induced regression errors we plotted the pairs as points in the plane, where and are the true and inferred association parameters of covariate , respectively, calculated via the recipes of [5]. This resulted in scatterplots as shown in Figure 2. Simulations were done for different values of the ratio , with multiple independent runs such that the number of points in each panel is identical. The true association parameters were drawn independently from a zero-average Gaussian distribution with for all . Perfect regression would imply finding all points to lie on the diagonal. Rather than a widening of the variance (as with finite sample size regression errors) overfitting-induced errors are somewhat surprisingly seen to manifest themselves mainly as a reproducible tilt of the data cloud, which increases with , and implies a consistent over-estimation of associations: both positive and negative will always be reported as more extreme than their true values. These observed errors in association parameters appear to be independent of the form of the true base hazard rate. Similarly, we show in Figure 3 the inferred integrated base hazard rates versus time (solid lines), together with the true values (dashed), which again shows consistent and reproducible overfitting errors. A quantitative theory of overfitting that can predict both the observed tilt and width of the data clouds of Figure 2 and the deformed inferred hazard rates of Figure 3 would enable us to correct the inferred parameters of the Cox model for overfitting, and thereby enable reliable regression up to hitherto forbidden ratios of .
There are mathematical obstacles to the development of a theory of overfitting in survival analysis, which probably explain why it has so far remained an open problem. First, unlike discriminant analysis, it is not immediately clear which error measure to study when outcomes to be predicted are event times. Second, in most survival analysis models (including Cox regression) the estimated parameters are to be solved from coupled transcendental equations, and cannot therefore be written in explicit form. Third, in the overfitting regime one will by definition find even for large that the inferred parameters depend on the realisation of the data set, while at the more macroscopic level of prediction accuracy there is no such dependence. It is thus not a priori clear which quantities to focus on in analytical studies of the regression process, and at which stage in the calculation (if any) averages over possible realisations of the data set may be performed safely.
Our present approach to the problem consists of distinct stages, each removing a specific obstacle, and this is reflected in the structure of our paper. We adapt to time-to-event regression the strategy proposed and executed several decades ago for binary classifiers in the groundbreaking paper by Gardner [22]. We first translate the problem of modelling overfitting into the calculation of a specific information-theoretic generating function, from which we can extract the information we need. Next we use Laplace’s argument to eliminate the maximisation over model parameters that comes with all ML methods, which is equivalent to writing the ground state energy of a statistical mechanical system as the zero temperature limit of the free energy. The third stage is devoted to making the resulting calculation of the generating function feasible, using the so-called replica method. This method has an impressive track record of several decades in the analysis of complex heterogeneous many-variable systems in physics [23, 24, 25, 26, 27], computer science [22, 28], biology [29, 30, 31], and economics [32, 33], and enables us to carry out analytically the average of the generating function over all possible realisations of the data set. Finally we exploit steepest descent integration for , leading to the identification of the ‘natural’ macroscopic order parameters of the problem, for which we derive closed equations within the replica symmetric (RS) ansatz. Some technical arguments are placed in appendices, to improve the flow of the paper. We develop our methods initially for generic time-to-event regression models, and then specialise to the Cox model. The final RS equations obtained for the Cox model involve a small number of scalar order parameters, from which we can compute the link between true and inferred regression parameters, and the inferred base hazard rate. The functional saddle point equation for the base hazard rate is rather nontrivial; while we can calculate the asymptotic form of its solution analytically, we limit ourselves mostly to a variational approximation, which already turns out to be quite accurate. We close with a discussion of our results, their implications and applications, and avenues for future work.
2 Overfitting in Maximum Likelihood models for survival analysis
2.1 Definitions
We assume we have simple time-to-event data of the standard type, consisting of independently drawn samples , with just one active risk and no censoring. Each sample consists of a covariate vector \mbox{\boldmathz}_{i}\in{\rm I\!R}^{p}, drawn independently from a distribution P(\mbox{\boldmathz}), and an associated time to event , drawn from P(t|\mbox{\boldmathz},\mbox{\boldmath\theta}^{\star}):
[TABLE]
Here P(t|\mbox{\boldmathz},\mbox{\boldmath\theta}^{\star}) describes a parametrised time-generating model, with unknown real-valued parameters collected in a vector \mbox{\boldmath\theta}^{\star}\in{\rm I\!R}^{q} that we seek to estimate from the data . We are not interested in estimating P(\mbox{\boldmathz}), so we take the covariate vectors \{\mbox{\boldmathz}_{1},\ldots,\mbox{\boldmathz}_{N}\} as given. The data probability for each parameter choice is
[TABLE]
We next define the empirical distribution of covariates and event times, given the observed data:
[TABLE]
This allows us to write
[TABLE]
with the conditional differential Shannon entropy of the event time distribution, and the Kullback-Leibler distance [34] between the empirical distribution \hat{P}(t|\mbox{\boldmathz},\mathscr{D}) and the parametrised form P(t|\mbox{\boldmathz},\mbox{\boldmath\theta}):
[TABLE]
The parameters estimated via the ML recipe are those that maximise P(\mathscr{D}|\mbox{\boldmath\theta}). According to (4) they minimise the Kullback-Leibler distance D(\hat{P}_{\mathscr{D}}||P_{\mbox{\boldmath\theta}}) between the empirical covariate-conditioned event time distribution and the parametrised event time distribution with parameter values :
[TABLE]
If for fixed and , the law of large numbers guarantees that \lim_{N\to\infty}\hat{P}(t|\mbox{\boldmathz},\mathscr{D})=P(t|\mbox{\boldmathz},\mbox{\boldmath\theta}^{\star}) (in a distributional sense), and hence ML regression will indeed estimate the parameters asymptotically correctly, provided the chosen paramerisation is unambiguous:
[TABLE]
In this paper, however, we focus on the regime of large datasets with high-dimensional covariate and parameter vectors where overfitting occurs, namely and . Here \hat{P}(t|\mbox{\boldmathz},\mathscr{D}) no longer converges to P(t|\mbox{\boldmathz},\mbox{\boldmath\theta}^{\star}) for in any mathematical sense, the identity (8) is therefore violated, and minimising D(\hat{P}_{\mathscr{D}}||P_{\mbox{\boldmath\theta}}) as per the ML prescription is no longer appropriate. This is the information-theoretic description of the overfitting phenomenon in survival analysis.
2.2 An information-theoretic measure of under- and overfitting
Maximum likelihood regression algorithms report those parameters for which P(t,\mbox{\boldmathz}|\mbox{\boldmath\theta}) is as similar as possible to the empirical distribution \hat{P}(t|\mbox{\boldmathz},\mathscr{D}), as opposed to the true distribution P(t|\mbox{\boldmathz},\mbox{\boldmath\theta}^{\star}) from which the data were generated. The optimal outcome of regression is for the inferred parameters to be identical to the true ones, i.e. to find {\rm argmin}_{\mbox{\boldmath\theta}}~{}D(\hat{P}_{\mathscr{D}}||P_{\mbox{\boldmath\theta}})=\mbox{\boldmath\theta}^{\star}. We therefore define
[TABLE]
This allows us to interpret the value of E(\mbox{\boldmath\theta}^{\star}\!,\mathscr{D}) as a measure of ML regression performance:
[TABLE]
Optimal regression algorithms would reduce D(\hat{P}_{\mathscr{D}}||P_{\mbox{\boldmath\theta}}) until D(\hat{P}_{\mathscr{D}}||P_{\mbox{\boldmath\theta}})=D(\hat{P}_{\mathscr{D}}||P_{\mbox{\boldmath\theta}^{\star}}) and then stop. Maximum likelihood regression will not do this; if it can reduce the Kullback-Leibler distance further it will do so, and thereby cause overfitting. For we expect E(\mbox{\boldmath\theta}^{\star}\!,\mathscr{D}) to depend on the data only via P(\mbox{\boldmathz}) and \mbox{\boldmath\theta}^{\star}, this is the fundamental assumption behind any regression. It allows us to focus on the average of E(\mbox{\boldmath\theta}^{\star}\!,\mathscr{D}) over all realisations of the data, given P(\mbox{\boldmathz}) and \mbox{\boldmath\theta}^{\star}:
[TABLE]
in which
[TABLE]
Evaluating E(\mbox{\boldmath\theta}^{\star}) analytically for is the focus of this paper. Clearly, if the relevant minimum over corresponds to the true value \mbox{\boldmath\theta}^{\star} for all , then E(\mbox{\boldmath\theta}^{\star})=0.
2.3 Analytical evaluation of the average over data sets
Working out (13) analytically for large requires first that we deal with the minimisation over . This can be done by converting the problem into the calculation of the ground state energy for a statistical mechanical system with degrees of freedom \mbox{\boldmath\theta}\in{\rm I\!R}^{q} and Hamiltonian222The rescaling with of the Hamiltonian is done in anticipation of subsequent limits. H(\mbox{\boldmath\theta})=NE(\mbox{\boldmath\theta}):
[TABLE]
For finite , the quantity E_{\gamma}(\mbox{\boldmath\theta}^{\star}) can be interpreted as the average result of a stochastic minimisation, based on carrying out gradient descent on the function -\log P(\mathscr{D}|\mbox{\boldmath\theta}), supplemented by a Gaussian white noise with variance proportional to .
The remaining obstacle is the logarithm in (16), which prevents the average over all data sets from factorising over the samples. This we handle using the so-called replica method, which is based on the identity , and to our knowledge has not yet been applied in survival analysis. In the replica method the average is carried out for integer , and the limit is taken at the end of the calculation via analytical continuation. Application to (16) leads us after some simple manipulations to a new expression in which the average over data sets does factorise over samples:
[TABLE]
The average over data sets has now been done, and we are left with a completely general explicit expression for E(\mbox{\boldmath\theta}^{\star}) in terms of the covariate statistics P(\mbox{\boldmathz}) and the assumed parametrised data generating model P(t|\mbox{\boldmathz},\mbox{\boldmath\theta}). We will now work out and study this expression for Cox’s proportional hazards model [5] with statistically independent zero-average Gaussian covariates.
2.4 Application to Cox regression
In Cox’s method [5] the model parameters are a base hazard rate (with ) and a vector \mbox{\boldmath\beta}\in{\rm I\!R}^{p} of regression coefficients. The assumed event time statistics are then of the following form:
[TABLE]
The factors only induce an irrelevant scaling factor that will make it easier to take the limit . In fact, for large it is inevitable that the typical association parameter in the Cox model will scale as , since otherwise one would not find finite nonzero event times.
For simplicity we assume that the covariates are distributed according to P(\mbox{\boldmathz})=(2\pi)^{-p/2}\exp(-\frac{1}{2}\mbox{\boldmathz}^{2}). This restriction of our analysis to uncorrelated covariates is no limitation, since for the Cox model one can always obtain, via a simple mapping, the regression results for data with correlated covariates from those obtained for uncorrelated covariates. This is demonstrated in A.
For the Cox model our general result (17) takes the following form, involving ordinary integration over -fold replicated vectors \mbox{\boldmath\beta}^{\alpha} and functional integration over -fold replicated base hazard rates :
[TABLE]
To enable efficient further analysis we define the short-hands
[TABLE]
and the -dimensional vector \mbox{\boldmathy}=(y_{0},\ldots,y_{p}). In addition we rename (\mbox{\boldmath\beta}^{\star},\lambda^{\star})=(\mbox{\boldmath\beta}^{0},\lambda^{0}), so that
[TABLE]
All are linear combinations of Gaussian random variables, so also p(\mbox{\boldmathy}|\mbox{\boldmath\beta}^{0},\ldots,\mbox{\boldmath\beta}^{n}) will be Gaussian (even for most non-Gaussian covariates this would still hold for large due to the central limit theorem), giving
[TABLE]
in which the entries of the covariance matrix \mbox{\boldmathC}[\{\mbox{\boldmath\beta}\}] are
[TABLE]
We introduce integrals over -distributions to transport variables to more convenient places, by substituting for each pair :
[TABLE]
We then obtain, after some simple manipulations,
[TABLE]
For finite , expressions such as (26) are of course not easy to use, but as with all statistical theories we will be able to progress upon assuming to be large333Note that the standard use of Cox regression away from the overfitting regime, including its formulae for confidence intervals and for p-values (which require Gaussian approximations that build on large expansions around the most probable parameter values, and assume that uncertainty in base hazard rates can be neglected), is similarly valid only when is sufficiently large.. We therefore focus on the asymptotic behaviour of (26) for , but with a fixed ratio , and will confirm a posteriori the extent to which the resulting theory describes what is observed for large but finite sample sizes.
3 Asymptotic analysis of overfitting in the Cox model
3.1 Conversion to a saddle-point problem
Following extensive experience with the replica method in other disciplines, with similar definitions, we assume that the two limits and commute. The invariance of the right-hand side of (26) under all permutations of the sample indices implies that E(\mbox{\boldmath\beta}^{0},\lambda_{0}) can depend on the true association parameters \mbox{\boldmath\beta}^{0} only via the distribution . With a modest amount of foresight we define , and obtain
[TABLE]
Writing the ratio of covariates over samples as , to be kept fixed in the limit , we may take the limit and obtain an integral that can be evaluated using steepest descent:
[TABLE]
in which the function to be extremized is
[TABLE]
Differentiation with respect to immediately gives . Moreover, for various integrals to be well-defined, the relevant saddle-point must (after contour deformation in the complex plane) be of a form where
[TABLE]
with , and where the matrix \mbox{\boldmathD}=\{D_{\alpha\rho}\} is positive definite. Thus at the relevant saddle-point we will have
[TABLE]
Variation with respect to the components gives , so
[TABLE]
This intermediate result confirms that indeed depends on the distribution only via , hence we may henceforth write the former quantity as . Variation with respect to finally gives (\mbox{\boldmathD}^{-1})_{\alpha\rho}=C_{\alpha\rho}\!-\!C_{0\alpha}C_{0\rho}/S^{2}. Hence we arrive at the following expression, in which the short-hand \mbox{\boldmathC}^{\prime} denotes the matrix with entries (for ):
[TABLE]
The extremisation over is to be done subject to , and we have removed from those terms that will vanish after taking and differentiating with respect to .
3.2 Replica symmetric extrema
The replica symmetry ansatz (RS) can be translated into the statement that the solution space of the regression algorithm is ergodic [25, 28, 18], i.e. the typical set of equivalent minima in regression parameter space is connected. Replica symmetric saddle-points of (3.1) are of the following form:
[TABLE]
In B we derive the equations corresponding to the RS ansatz for the stochastic generalization of the Cox model. With the short-hand , and upon removing terms that vanish upon differentiation by , we can summarise these equations in the limit of large data sets, by the following compact expression:
[TABLE]
in which the order parameters , which are related to the RS order parameters via
[TABLE]
are to be evaluated at the saddle point of
[TABLE]
3.3 Physical interpretation of order parameters
The physical meaning of the order parameters in the replica symmetric matrix is found in the usual manner for replica calculations [25], by direct application of our manipulations to the calculation of observables. We will write averages over the stochastic maximization of the data log-likelihood at finite , for a fixed training set , as , and averages over all data sets (as before) as . Since the relevant quantities in the theory are found asymptotically to depend on the true association vector \mbox{\boldmath\beta}^{\star} only via , there is no need for explicit averages over \mbox{\boldmath\beta}^{\star}. This results upon application to the Cox model in the following identifications, in the limit :
[TABLE]
In terms of the transformed order parameters this becomes
[TABLE]
Here is the outcome of maximum likelihood regression for data set generated with true association parameters \mbox{\boldmath\beta}^{\star}. Fully random parameter guessing would give and . Perfect regression would imply \mbox{\boldmath\beta}=\mbox{\boldmath\beta}^{\star} for all and all \mbox{\boldmath\beta}^{\star}, and hence correspond to , giving and . It is reassuring to observe that for , expression (3.2) indeed reproduces if in the right-hand side we substitute the values and .
From (40) follow useful inequalities that must hold at the relevant saddle-point in the limit , which are consistent with our claim that :
[TABLE]
The first four inequalities are easy to derive. The fifth follows from:
[TABLE]
If, as suggested by the simulation results shown in Section 1, \langle\mbox{\boldmath\beta}\rangle\approx\kappa\mbox{\boldmath\beta}^{\star}+\mbox{\boldmath\xi} for some , with a zero-average random vector that reflects data set variability, such that \langle\mbox{\boldmath\xi}\rangle_{\mathscr{D}}=\mbox{\boldmath0} and with amplitude , then we would find the RS saddle point obeying and . Hence we would find and , and we would expect for . Note that the above relations are true given our definition of the event time distribution as P(t|\mbox{\boldmathz},\mbox{\boldmath\beta},\lambda)=-\frac{\rmd}{\rmd t}\exp[-\exp(\mbox{\boldmath\beta}\cdot\mbox{\boldmathz}/\sqrt{p})\Lambda(t)]. If we were to define this distribution instead without the rescaling factor as P(t|\mbox{\boldmathz},\mbox{\boldmath\beta},\lambda)=-\frac{\rmd}{\rmd t}\exp[-\exp(\mbox{\boldmath\beta}\cdot\mbox{\boldmathz})\Lambda(t)] (which is the convention of [5]), then the connection between regression of the form \langle\mbox{\boldmath\beta}\rangle\approx\kappa\mbox{\boldmath\beta}^{\star}+\mbox{\boldmath\xi} and our order parameters would be:
[TABLE]
We conclude that from our RS equations we can extract the dependence on the covariates/samples ratio of the two main quantitative characteristics of the data clouds in Figure 2: their angle and their width .
Finally, let us turn to the interpretation of equation (3.2). We observe that this equation can be written as
[TABLE]
If we compare expression (47) with the definition of , which for the Cox model is
[TABLE]
we can infer that
[TABLE]
As a consistency test one can confirm that, as an alternative to retracing the replica derivation, the expressions (40) can also be derived explicitly from (48,50).
3.4 Derivation of RS saddle point equations
The equations from which to solve the replica symmetric order parameters are obtained by extremization of (3.2). Using , the three scalar equations are found to be
[TABLE]
Upon integrating by parts over , we can also write equation (51) as
[TABLE]
To work out the functional order parameter equation we use , and the abbreviation . This gives
[TABLE]
This latter equation can also be written in terms of the distribution (48), giving a form that reduces to Breslow’s [16] estimator when we subsequently use the interpretation identity (50):
[TABLE]
The remaining integrations over in our equations are for finite quite nontrivial. They can be expressed in terms of the Laplace transform of the lognormal distribution [36], or mapped onto the core integral in the Random Energy Model [37], both of which could in the past be evaluated analytically only in specific parameter limits.
4 Analysis of the RS equations for the Cox model
4.1 RS equations in the limit
The original Cox model [5] corresponds to the limit of our equations. It turns out that the correct scaling with of for is ; this is suggested by equation (3.4) and confirms our expectation that follows from the physical meaning of . Upon substituting as an ansatz into our equations, assuming the other order parameters to have finite limits, allows us to simplify the trio (LABEL:eq:spe_v,LABEL:eq:spe_w,3.4) and the functional equation (55) to
[TABLE]
The remaining complexities of the limit are concentrated in
[TABLE]
with
[TABLE]
After differentiation and rewriting the resulting equation, we find that can be written in explicit form in terms of the Lambert W-function [35] as:
[TABLE]
Hence
[TABLE]
Using the identity , which follows directly from the definition of the Lambert -function, we can simplify the above result to
[TABLE]
Substitution into our order parameter equations finally gives:
[TABLE]
We observe that the choice always solves (67), but that for it is ruled out by (66). Upon doing integration by parts over , using and dismissing the solution , we can simplify equation (67) further to
[TABLE]
To compute the corresponding value of the overfitting measure , we substitute into (3.2) and take the limit . This gives, using the short-hands (63) and and the identity :
[TABLE]
The second integral can be worked out explicitly:
[TABLE]
Therefore
[TABLE]
In C we study the behaviour of the above equations in the two limits and . For we recover the correct solution corresponding to perfect (overfitting-free) regression, as required. For we find a phase transition, characterised by divergence of the order parameters .
4.2 Numerical and asymptotic solution of RS equations
Solving the coupled order parameter equations (66,68,69,70) analytically seems for now too ambitious; solving them numerically is nontrivial, and requires some preparation. To cast the equation for into a form similar to the others, we need to do partial integration over :
[TABLE]
We also rewrite the functional equation in a form that involves only:
[TABLE]
Numerical integration over can be transformed into integration over the survival function , using and . We also define the short-hand . These definitions transform our RS equations to:
[TABLE]
We next study the functional equation (79) in more detail. We first rewrite it by differentiation with respect to time, and some simple rearrangements, into the more suitable form
[TABLE]
or, upon further differentiation:
[TABLE]
Using , and upon multiplying both sides by , this becomes
[TABLE]
We write in the form , which is always possible since both and are monotonic functions of time, and we write with
[TABLE]
Substitution of these conventions, and working out the various time derivatives, then leads to the following equation from which to solve :
[TABLE]
We now proceed to calculate the solution of the above equation, which gives us the form of the inferred integrated base hazard rates as shown in Figure 3, for large times, i.e. in the regime where and . Here we can use use the asymptotic form of the Lambert -function [35]: (for ), to obtain
[TABLE]
We can do the remaining integral over via integration by parts, giving
[TABLE]
Hence
[TABLE]
To proceed we need the leading orders of . These are derived in D:
[TABLE]
Our asymptotic equation for thereby becomes
[TABLE]
Inspection of this equation shows that the leading orders of the solution are
[TABLE]
or
[TABLE]
This remarkably simple expression, linking the true and the inferred integrated base hazard rates and , predicts that the relation between the two should approach a straight line when shown in a log-log plot. It is not only confirmed by simulations for large times (for which it was derived from our theory) but is in fact found to be quite accurate for all times. This is shown in Figure 4, and forms the basis of our variational approximations below.
4.3 Variational approximation
The main complexity of the RS theory is in solving the functional order parameter equation (82). This is the motivation for investigating variational approximations for . Since our equations were obtained by solving an extremization problem, variational approaches are in the present context both natural and conceptually straightforward. The simulation data in Figure 4 suggest writing the functional order parameter in the form . To compute the new scalar order parameters and we substitute this expression for into the quantity (3.2) to be extremized. As before we then put and take the limit , and find that we need to extremize the following quantity over :
[TABLE]
in which
[TABLE]
It is now easy to derive our order parameter equations, since all contributions to partial derivatives that involve vanish, by virtue of maximising the factor between the square brackets. Extremizing (93) over recovers our earlier equations (76,77,78), with , as expected. Extremizing (93) over the new order parameters and gives:
[TABLE]
Using and our definition of , these two equations can be rewritten as
[TABLE]
In the second equation we rewrite the term with the explicit factor , using
[TABLE]
We thus arrive at five relatively simple closed equations from which to solve in our variational approximation. Upon substituting the definition we can simplify the argument of Lambert’s -function, which appears in all equations, further to
[TABLE]
This enables us to combine the two Gaussian integrals appearing in each order parameter equation by a single zero-average Gaussian integral, with width
[TABLE]
We finally transform the variational order parameter to , and evaluate [38], which involves Euler’s constant . We then obtain
[TABLE]
In the same way we can work out the value of for the variational solution, and find:
[TABLE]
For we may replace and use the integral , to recover after some simple expansions the correct solution: , , , and .
We observe that our above closed variational equations (102–106) are completely independent of the true base hazard rate . Hence they predict that the key quantities required for overfitting correction in the Cox model (the slope of the data cloud, and the deformation parameters of the base hazard rate) are independent of the true shape of the base hazard rate.
The easiest protocol for solving our equations numerically is to regard as an independent parameter, and compute for each by iterative mapping. Upon doing so (see Figure 5), one finds that the solution always exhibits , within numerical accuracy limitations. We have not yet been able to confirm this analytically, as that would require proving that the solution of our equation obeys
[TABLE]
but it is for small in agreement with (91) (as it should be). If is indeed generally true for the solution of our variational equations, it implies that is identical to the slope of the data clouds in Figure 2, and that the values of (hence also of the slope and the width of the data clouds in Figure 2) are not only independent of but also independent of . It would also allow us to obtain a more compact closed theory in terms of just three scalar order parameters, as we will show now. Upon making directly the variational ansatz with , we need to extremize
[TABLE]
in which again . Following similar manipulations as used for the first variational analysis, and with the previous short-hand , we find upon extremization of and after elimination of the following three closed equations for :
[TABLE]
Upon solving the trio (110,111,112), the values of , and then follow via
[TABLE]
Finally we note that all our equations in this section can also be written in a form that involves only integrations over the interval , using the general identity
[TABLE]
It is instructive at this stage to test the predictions of the above simple variational equations (110,111,112) against numerical simulations of Cox regression on synthetic data. According to (41,42,43), we must expect to find in our simulations that and , where
[TABLE]
Here denotes the inferred values of the (rescaled) regression parameters, and the averages are over randomly generated data sets. The results of measuring and in numerical simulations are shown in Figure 6 together with the variational predictions. In spite of the modest values in our simulations of and the finite number of training sets over which inferred parameters are averaged in evaluating (115,116) (which one expects to generate excess variability), the agreement between the variational predictions and the simulations is seen to be surprisingly good.
5 Tests and applications
We will now test the variational RS theory (110,111,112) further against numerical simulations, focusing on the the dependence on the ratio of the main characteristics of the regression parameter data clouds of Figure 2 (i.e. their slope and their width ), and of the integrated base hazard rates as shown e.g. in Figure 3. We know (46) that the theory predicts and (for the standard scaling convention of the Cox model [5], i.e. for p(t|\mbox{\boldmathz})=-\frac{\rmd}{\rmd t}\exp[-\exp(\mbox{\boldmath\beta}\cdot\mbox{\boldmathz})\Lambda(y)]), and these predictions are plotted in Figure 7 as solid lines, together with the values obtained in regression simulations of the Cox model on synthetic data (markers), for and , and for two distinct choices for the true base hazard rate . Modulo finite size effects, which increase as we approach the phase transition point , there is again good agreement between theory and simulations. The data confirm also the prediction of the variational theory that both and are independent of the true base hazard rate .
In Figure 8 we compare the inferred integrated base hazard rates , obtained for synthetic data with , with the predictions of the variational RS theory (110,111,112), for two choices of the base hazard rate. The agreement is satisfactory for times of the order of the typical event times in the data. For larger times (where the theory has to extrapolate to times where available data are at best sparse) one observes increasing deviations, with the variational theory underestimating the impact of overfitting; this is indeed consistent with (92), since the variational approximation captures only the first (leading) term of the exact expansion (92). We can in principle obtain more accurate integrated base hazard rate predictions within the current framework, but this requires that we either solve (numerically) the full RS equations (76,77,78,79), or develop a more refined variational ansatz for the function .
We found in our simulations that as the ratio increases, higher numerical precision is required in solving Cox’s equations. For values and , using conventional C-code compiled with gcc at double floating point precision (data type ‘double’) will occasional lead to degeneracies in the equations that cause the association parameters \hat{\mbox{\boldmath\beta}} to be ill-defined. Upon switching to quadruple floating point precision (data type ‘long double’) these degeneracies disappeared.
The present RS theory has so far been tested only for ‘normal’ regimes for the parameter , which represents the typical width of the sum , and hence the typical scale of the covariate-conditioned hazard rates. It turns out that upon carrying out Cox regression for synthetic survival data with large values of and very large values of , we observe ergodicity breaking: upon plotting true versus inferred association parameters, as in Figure 3, for different simulation experiments with the same parameters and , we now find multiple data clouds with distinct slopes, as opposed to a single data cloud with unique reproducible characteristics. This suggest that the relevant saddle points in the replica calculation will no longer be replica-symmetric. This phenomenology, of which examples are shown in Figure 9, can be studied in a natural way within the replica formalism, but it requires so-called RSB (replica symmetry breaking) ansätze for the overlap matrix . One anticipates that for sufficiently large values of there may be a critical value of that marks an RSB transition, i.e. the onset of non-ergodicity; the preliminary data in Figure 9 suggest that this critical value may also depend on the shape of the true base hazard rate. Computing these critical values of from the replica formalism, in terms of the parameters , and , will be the subject of a future study.
6 Discussion
The Cox model has been by far the most popular and effective statistical tool for the analysis of time-to-event data in medicine, since its publication nearly half a century ago. However, the demands on statistical methods in 21st century medicine are changing. We can now take measurements on individual patients of unprecedented dimensionality , such as gene expressions and high-resolution imaging data, but the typical number of samples in our medical data bases has not grown in proportion. As a result, the condition for maximum likelihood (ML) multivariate regression methods (including the model of Cox) to be applicable, being in order to avoid overfitting, is nowadays very often not met. Apart from a few early (and modest) simulation experiments, there appear not to have been any published studies aimed at modelling mathematically the mechanism of overfitting in Cox regression, which is a prerequisite for the development of methods to deal with the overfitting problem. When the dimensionality of the data, relative to the number of available samples, is too high to justify using the multivariate Cox model, medical statisticians and epidemiologists are presently left having to resort to poor alternatives for proper regression: they can either limit a priori the number of covariates used in regression (and thereby limit outcome prediction potential), or switch to univariate analysis (which is undesirable since we know that univariate estimates of association parameters correlate poorly with their multivariate counterparts), or work with so-called ‘risk signatures’ (which tend to involve ad-hoc definitions, and ad-hoc recipes for interpretation). Thus, expensive and potentially informative high-dimensional clinical data remain under-utilised.
Our regression simulations with synthetic survival data show clearly that the mechanism of overfitting in Cox regression is surprisingly reproducible and consistent: it always leads to a clear bias, which reports association parameter values that are more extreme than their true values, underestimates base hazard rates for short times, and over-estimates base hazard rates for large times. This consistency suggests that it must in principle be possible to model overfitting mathematically, and that (if such modelling is successful) one should be able to correct the outcomes of Cox regression systematically for the impact of overfitting. This, in turn, would allow us to do multivariate regression reliably for significantly larger ratios of the number of covariates over the number of samples, and obtain more accurate and reproducible predictions of clinical outcomes.
In this paper we have presented such a theory, which is built on the mathematical methods of statistical mechanics and inspired by Gardner’s famous analysis of binary classifiers [22]. It assumes that is large, but with finite, and it combines three ideas: (i) the formulation of an information-theoretic measure of overfitting in time-to-event regression, (ii) translating the calculation of this quantity into computing the ground state of a statistical mechanical system, and (iii) dealing with the heterogeneity in the problem (here: the realisation of the data set) with the replica method. Our modeling approach is generic. It is developed initially for arbitrary parametrised time-to-event regression models, but we devote most of our paper to the Cox model, in recognition of its importance and dominance in the medical statistics field. We show that by combining the above three ideas, it is possible to derive explicit macroscopic equations, exact in the asymptotic limit, with which to characterise the regression process for finite values of the ratio . In this paper we assume that the regression process is ergodic, and make the so-called replica symmetric (RS) ansatz for the solution of our equations; this assumption is supported by numerical simulations, provided the true association parameters are not too large.
For the Cox model, the order parameters of the RS theory contain all the relevant information required to quantify the impact of overfitting, but since one of them is a function (the inferred integrated base hazard rate), we introduced a suitable variational approximation, which resulted in a much simpler three-parameter theory. The simplified theory makes various qualitative predictions that are confirmed by regression simulations with synthetic data: that the ‘inflation’ of inferred association parameters is independent of the amplitude of the true association parameters and of the true base hazard rate, that there is a phase transition when , that the base hazard rate is underestimated for short times and over-estimated for large times, and that the relation between inferred and true integrated base hazard rate is for large times of the form , with a parameter that increases with the ratio . The quantitative agreement between our variational theory and regression simulations with synthetic data is generally very good, modulo finite size fluctuations, including the predicted overfitting-induced bias in association parameters. The only exception is the integrated base rate at large times, where available data are sparse, and where the variational ansatz (which incorporates only the leading order time dependence) under-estimates the impact of overfitting. Upon increasing the values of and , we observe new phenomenology, such as ergodicity breaking in the regression process (which requires order parameters with broken replica symmetry, or RSB). The calculation of the RSB transition line will be the subject of a subsequent paper.
The present study represents only a first step. It demonstrates that it is possible to model overfitting in Cox regression mathematically, using the replica formalism. We envisage many direct extensions, such as increasing the precision of our predictions by constructing full non-variational solutions to our RS order parameter equations (analytically or numerically), the incorporation of censoring, and the addition of MAP-type regulariser terms. More technical potential follow-up studies could investigate RSB phenomena, including the calculation of the ergodicity breaking transition line, or the impact of having covariate distributions for which the sums no longer have Gaussian statistics. Casting the net somewhat wider, and given our more general initial formulation of the theory, we expect that there will be other survival analysis models for which a similar overfitting analysis can be done.
Last but certainly not least, we would now like to explore the potential of our methodology to provide practical tools with which to correct multivariate Cox regression analyses of real time-to-event data in medicine for the impact of overfitting. Such tools could be used retrospectively, to determine objectively which past results in the medical literature that were obtained with the Cox method can be trusted, and which perhaps cannot. They should hopefully also lead to more accurate clinical outcome predictions in the future, by allowing medical statisticians to include more covariates in multivariate regression by default, without overfitting danger, and enable the construction of sample size tables for multivariate regression that allow overfitting effects to be taken into account in the design of clinical trials. The results presented in this paper suggest that in the near future, with proper overfitting corrections, reliable multivariate regression for time-to-event data at ratios of up to or higher will be quite feasible.
Acknowledgements
We would like to thank Bryan Lutchmanen for contributing to the regression simulation studies, and Anita Grigoriadis for the data used to produce Figure 1. We are also grateful for support from Saddle Point Science, the Engineering and Physical Sciences Research Council (EPSRC), and the Medical Research Council (MRC) of the United Kingdom.
References
- [1]
Hougaard P 2001 Analysis of Multivariate Survival Data (New York: Springer)
- [2]
Klein JP and Moeschberger ML 2003 Survival Analysis - Techniques for Censored and Truncated Data (New York: Springer)
- [3]
Ibrahim JG, Chen MH and Sinha D 2010 Bayesian Survival Analysis (New York: Springer)
- [4]
Crowder M 2012 Multivariate Survival Analysis and Competing Risks (London: CRC Press)
- [5]
Cox DR 1972 J. Roy. Stat. Soc. B 34 187
- [6]
Witten DM and Tibshirani R 2009 J. Roy. Stat. Soc. B 71 615
- [7]
Witten DM and Tibshirani R 2010 Stat. Meth. Med. Res. 19 29
- [8]
Keiding N, Andersen PK and Klein JP 1997 Statistics in Medicine 16 215
- [9]
Vaida F and Xu R 2000 Statistics in Medicine 19 3309
- [10]
Duchateau L and Jansen P 2008 The Frailty Model (Statistics for Biology and Health) (New York: Springer)
- [11]
Wienke A 2010 Frailty Models in Survival Analysis (CRC Biostatistics Series) (Boca Raton: Chapman & Hall)
- [12]
Rowley M, Garmö H, Van Hemelrijck M, Wulaningsih W, Grundmark B, Zethelius B, Hammar N, Walldius G, Inoue M, Holmberg L and Coolen ACC 2017 Statistics in Medicine DOI: 10.1002/sim.7246
- [13]
Grigoriadis A, Gazinzwa P, Pai T, Irshad S, Wu Y, Naidoo K, Millis R, Gillett CE, Tutt A, Coolen ACC and Pinder S 2017 manuscript under review
- [14]
Concato J, Feinstein AR and Holford TR 1993 Annals of Internal Medicine 118 201
- [15]
Babyak MA 2004 Psychosomatic Medicine 66 411
- [16]
Breslow NE 1972 Discussion section of the paper [5] by DR Cox
- [17]
MacKay DJC 2003 Information Theory, Inference and Learning Algorithms (Cambridge: University Press)
- [18]
Coolen ACC, Kühn R and Sollich P 2005 Theory of Neural Information Processing Systems (Oxford: University Press)
- [19]
Peduzzi P, Concato J, Feinstein AR and Holford T 1995 J. Clin. Epidemiol. 48 1503
- [20]
Kawada T 2011 Int. J. Cardiol. 153 110
- [21]
Dobbin KK and Song X 2013 Biostatistics 14 639
- [22]
Gardner E 1987 Europhys. Lett. 4 481
- [23]
Sherrington D and Kirkpatrick S 1975 Phys. Rev. Lett. 35 1792
- [24]
Parisi G 1979 Phys. Lett. A 73 203
- [25]
Mézard M, Parisi G and Virasoro M A 1987 Spin glass theory and beyond (Singapore: World Scientific)
- [26]
Monasson R 1998 J. Phys. A: Math. Gen. 31 513
- [27]
Van Mourik J and Coolen ACC 2001 J. Phys. A: Math. Gen. 34 L111
- [28]
Nishimori H 2001 Statistical Physics of Spin Glasses and Information Processing (Oxford: University Press)
- [29]
Amit DJ, Gutfreund H and Sompolinsky H 1985 Phys. Rev. A 32 1007
- [30]
Rabello S, Coolen ACC, Pérez-Vicente CJ and Fraternali F 2008 J. Phys. A: Math. Theor. 41 285004
- [31]
Agliari E, Annibale A, Barra A, Coolen ACC and Tantari D 2013 J. Phys. A: : Math. Theor. 46 415003
- [32]
Challet D, Marsili M and Zecchina R 2000 Phys. Rev. Lett. 84 1824
- [33]
Marsili M and Challet D 2001 Phys. Rev. E 64 056138
- [34]
Cover TM and Thomas JA 1991 Elements of Information Theory (New York: Wiley)
- [35]
Corless RM, Gonnet GH, Hare DEG, Jeffrey DJ and Knuth DE 1996 Adv. Comp. Math. 5 329
- [36]
Asmussen S, Jensen JL and Rojas-Nandayapa L 2015 Methodol. Comput. Appl. Prob. DOI 10.1007/s11009-014-9430-7
- [37]
Derrida B 1981 Phys. Rev. B 24 2613
- [38]
Gradshteyn IS and Rhyzik IM 1979 Table of Integrals, Series and Products (London: Academic Press)
Appendix A Covariate correlations in Cox regression
In the absence of censoring, the equations from which to compute the inferred base hazard rate and the inferred association parameters \hat{\mbox{\boldmath\beta}}\in{\rm I\!R}^{p} in Cox regression are the following [5]:
[TABLE]
Let us define the average values and correlations of the covariates as \langle\mbox{\boldmathz}\rangle=\bar{\mbox{\boldmathz}} and , with \langle f(\mbox{\boldmathz})\rangle=N^{-1}\sum_{i=1}^{N}f(\mbox{\boldmathz}_{i}). We can then simply write the original \{\mbox{\boldmathz}_{i}\} in terms of zero-average and uncorrelated covariate vectors \{\tilde{\mbox{\boldmathz}}_{i}\}, by writing \mbox{\boldmathz}_{i}=\bar{\mbox{\boldmathz}}+\mbox{\boldmathA}^{\frac{1}{2}}\tilde{\mbox{\boldmathz}}_{i}. The equation for the regression parameters thereby becomes
[TABLE]
Hence \hat{\mbox{\boldmath\beta}}=\mbox{\boldmathA}^{-\frac{1}{2}}\tilde{\mbox{\boldmath\beta}}, in which \tilde{\mbox{\boldmath\beta}} is the regression outcome of the Cox method applied to the zero-average, uncorrelated and normalized covariates \{\tilde{\mbox{\boldmathz}}_{i}\}, i.e.
[TABLE]
Similarly, for the base hazard rate we find:
[TABLE]
Hence \hat{\lambda}(t)=\tilde{\lambda}(t)\exp(-\tilde{\mbox{\boldmath\beta}}\!\cdot\!\mbox{\boldmathA}^{-\frac{1}{2}}\bar{\mbox{\boldmathz}}), in which is given by Breslow’s formula (the regression outcome for the base hazard rate of the Cox method) applied once more to the zero-average uncorrelated and normalised covariates \{\tilde{\mbox{\boldmathz}}_{i}\}, i.e.
[TABLE]
We conclude that for the Cox model one can always express the regression outcomes for any choice of covariate vectors in terms of the regression outcomes for zero-average, normalized and uncorrelated covariates, where and .
Appendix B Deriviation of the replica symmetric equations
Assuming replica symmetry to hold converts our problem into calculating
[TABLE]
To proceed we need the determinant and inverse of the covariance matrix , and the determinant of the matrix \mbox{\boldmathC}^{\prime}. Both and \mbox{\boldmathC}^{-1} will inherit the assumed replica-symmetric (RS) structure of the saddle-point. Hence they must have the respective forms
[TABLE]
The RS eigenvectors and eigenvalues of are calculated easily:
[TABLE]
It follows that
[TABLE]
We obtain the parameters by multiplying the two matrices in (135) and demanding that this gives the identity matrix. After some simple algebra this results in:
[TABLE]
It is now a trivial matter to calculate also the quantity \log{\rm Det}\mbox{\boldmathC}^{\prime}, since the RS form of implies that for we have . It has one eigenvector with eigenvalue , and an -fold degenerate eigenspace with eigenvalue . Hence
[TABLE]
Inserting these results into (B) gives, with the short-hand , and upon carrying out successive Taylor expansions for small :
[TABLE]
This expression takes a simpler form if we introduce the following transformation of the trio to new non-negative variables :
[TABLE]
with inverse transformation
[TABLE]
With these definitions, and upon removing terms that vanish upon differentiation by , we can summarise the current state of our RS calculations for the stochastic generalization of the Cox model, in the limit of large data sets, by the following compact expression:
[TABLE]
If we transform , we can write this result equivalently as
[TABLE]
At the relevant saddle point, the order parameter derivative of the function that is being extremized will by definition be zero, so
[TABLE]
in which the order parameters are to be evaluated at the saddle point of
[TABLE]
Appendix C The limits and
For , the limit of no overfitting, we immediately find from (66,70) that . To find also and we need to go to the next order in , using . This results in
[TABLE]
It follows that and for , and that and are to be solved from the following two coupled equations:
[TABLE]
After some simple rewriting and integration by parts over time, they take the alternative forms
[TABLE]
From this we immediately confirm the correct solution and , which describes perfect inference, as expected for . From the pair (47,48) we also find the correct corresponding value for :
[TABLE]
Next we turn to the limit . Here it follows from (70) that , and we need the expansion of for large arguments, i.e. . With a modest amount of foresight we make the ansatz and for . Using
[TABLE]
our order parameter equations then give
[TABLE]
Our scaling ansatz is seen to be consistent with the three scalar order parameter equations. Hence , and all diverge at a phase transition point , whereas for the functional order parameter equation we find in the limit :
[TABLE]
From this it follows after differentiation that , and after some further manipulations one arrives at the following degenerate solution for :
[TABLE]
Apparently, as one varies the ratio of the number of covariates over the number of samples in the deterministic Cox model, the integrated inferred base hazard rate changes from the correct shape at to a step function at the phase transition point , with the discontinuity at some time point that should follow from inspecting sub-leading orders in . Moreover, at this transition (if not even earlier) one expects to find breaking of the assumed replica symmetry.
Appendix D Asymptotic form of the event time distribution
Here we calculate the asymptotic form of the function for , and derive expression (88). Working out the definition gives
[TABLE]
with
[TABLE]
Differentiation shows that the function is mimimal at , where is Lambert’s -function [35]. Expansion of around its minimum gives:
[TABLE]
This leads to the following Gaussian approximation of the integral over :
[TABLE]
Application to then gives:
[TABLE]
Finally, for we can use to obtain
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Hougaard P 2001 Analysis of Multivariate Survival Data (New York: Springer)
- 2[2] Klein JP and Moeschberger ML 2003 Survival Analysis - Techniques for Censored and Truncated Data (New York: Springer)
- 3[3] Ibrahim JG, Chen MH and Sinha D 2010 Bayesian Survival Analysis (New York: Springer)
- 4[4] Crowder M 2012 Multivariate Survival Analysis and Competing Risks (London: CRC Press)
- 5[5] Cox DR 1972 J. Roy. Stat. Soc. B 34 187
- 6[6] Witten DM and Tibshirani R 2009 J. Roy. Stat. Soc. B 71 615
- 7[7] Witten DM and Tibshirani R 2010 Stat. Meth. Med. Res. 19 29
- 8[8] Keiding N, Andersen PK and Klein JP 1997 Statistics in Medicine 16 215
