Boosted nonparametric hazards with time-dependent covariates

Donald K.K. Lee; Ningyuan Chen; Hemant Ishwaran

arXiv:1701.07926·stat.ML·October 7, 2021

Boosted nonparametric hazards with time-dependent covariates

Donald K.K. Lee, Ningyuan Chen, Hemant Ishwaran

PDF

TL;DR

This paper introduces a gradient boosting method for nonparametric hazard estimation with time-dependent covariates, providing theoretical guarantees and practical implementation insights.

Contribution

It develops a convex representation of the nonparametric likelihood and a generic boosting algorithm, including an implementation with regression trees, with consistency and regularization analysis.

Findings

01

The estimator is consistent under correct model specification.

02

An oracle inequality is established for tree-based models.

03

Step-size restriction prevents convergence issues due to risk curvature.

Abstract

Given functional data from a survival process with time-dependent covariates, we derive a smooth convex representation for its nonparametric log-likelihood functional and obtain its functional gradient. From this, we devise a generic gradient boosting procedure for estimating the hazard function nonparametrically. An illustrative implementation of the procedure using regression trees is described to show how to recover the unknown hazard. The generic estimator is consistent if the model is correctly specified; alternatively, an oracle inequality can be demonstrated for tree-based models. To avoid overfitting, boosting employs several regularization devices. One of them is step-size restriction, but the rationale for this is somewhat mysterious from the viewpoint of consistency. Our work brings some clarity to this issue by revealing that step-size restriction is a mechanism for…

Figures5

Click any figure to enlarge with its caption.

Tables2

Table 1. Table 1: Relative importance of variables in the boosted nonparametric estimator. The numbers are scaled so that the largest value in each row is 1.

$a$	Time	Age	ESI	Census	All other variables
0	1	0.21	0.025	0.0011	<0.0010
1	1	0.22	0.013	0.46	<0.0003
2	0.34	0.064	0.0020	1	0
3	0.11	0.011	<0.0001	1	0

Table 2. Table 2: Comparative performances (%MSE) as the service rate ( 38 ) becomes increasingly dependent on the time-varying ward census variable (by increasing a 𝑎 a ).

	blackboost	Transformation forest	Boosted hazards	Ad-hoc
$a$	(set to true log-	(set to true log-	( $ε$ fixed for	(# splits fixed
	normal distribution)	normal distribution)	all iterations)	for all iterations)
0	5.0%	5.0%	7.8%	7.1%
1	17%	6.1%	4.5%	8.1%
2	46%	9.7%	5.4%	7.0%
3	67%	18%	7.2%	7.4%

Equations249

λ (t, X (t)) Y (t) d t .

λ (t, X (t)) Y (t) d t .

\begin{array}[]{c}\{1-\lambda(0,X_{i}(0))Y_{i}(0)dt\}\times\{1-\lambda(dt,X_{i}(dt))Y_{i}(dt)dt\}\times\cdots\times\lambda(T_{i},X_{i}(T_{i}))^{\Delta_{i}}\\[10.0pt] \xrightarrow[dt\downarrow 0]{}e^{-\int_{0}^{1}Y_{i}(t)\lambda(t,X_{i}(t))dt}\lambda(T_{i},X_{i}(T_{i}))^{\Delta_{i}},\end{array}

\begin{array}[]{c}\{1-\lambda(0,X_{i}(0))Y_{i}(0)dt\}\times\{1-\lambda(dt,X_{i}(dt))Y_{i}(dt)dt\}\times\cdots\times\lambda(T_{i},X_{i}(T_{i}))^{\Delta_{i}}\\[10.0pt] \xrightarrow[dt\downarrow 0]{}e^{-\int_{0}^{1}Y_{i}(t)\lambda(t,X_{i}(t))dt}\lambda(T_{i},X_{i}(T_{i}))^{\Delta_{i}},\end{array}

F (t, x) = lo g λ (t, x),

F (t, x) = lo g λ (t, x),

R_{n} (F) = \frac{1}{n} i = 1 \sum n \int_{0}^{1} Y_{i} (t) e^{F (t, X_{i} (t))} d t - \frac{1}{n} i = 1 \sum n Δ_{i} F (T_{i}, X_{i} (T_{i})),

R_{n} (F) = \frac{1}{n} i = 1 \sum n \int_{0}^{1} Y_{i} (t) e^{F (t, X_{i} (t))} d t - \frac{1}{n} i = 1 \sum n Δ_{i} F (T_{i}, X_{i} (T_{i})),

\displaystyle\hskip-25.0pt\frac{d}{d\theta}{R}_{n}(F+\theta f)\Big{|}_{\theta=0}

\displaystyle\hskip-25.0pt\frac{d}{d\theta}{R}_{n}(F+\theta f)\Big{|}_{\theta=0}

= \frac{1}{n} i = 1 \sum n \int_{0}^{1} Y_{i} (t) e^{F (t, X_{i} (t))} f (t, X_{i} (t)) d t - \frac{1}{n} i = 1 \sum n Δ_{i} f (T_{i}, X_{i} (T_{i})),

⟨ g, f ⟩_{†}

⟨ g, f ⟩_{†}

⟨ g, f ⟩_{‡}

\tilde{F}\leftarrow\tilde{F}-\nu\hat{f},\hskip 15.0pt\hat{f}=\mathop{\rm argmin}_{f\in\tilde{{\mathscr{F}}}}\bigg{\|}\frac{\partial L}{\partial\tilde{F}}-f\,\bigg{\|}_{2}.

\tilde{F}\leftarrow\tilde{F}-\nu\hat{f},\hskip 15.0pt\hat{f}=\mathop{\rm argmin}_{f\in\tilde{{\mathscr{F}}}}\bigg{\|}\frac{\partial L}{\partial\tilde{F}}-f\,\bigg{\|}_{2}.

μ (B)

μ (B)

μ_{n} (B)

\int f d μ

\int f d μ

\int f d μ_{n}

∥ f ∥_{μ_{n}, 1}

∥ f ∥_{μ_{n}, 1}

∥ f ∥_{μ_{n}, 2}

∥ f ∥_{\infty}

⟨ f_{1}, f_{2} ⟩_{μ_{n}}

F = {j = 1 \sum d c_{j} ϕ_{j} : c_{j} \in R} .

F = {j = 1 \sum d c_{j} ϕ_{j} : c_{j} \in R} .

B_{j}=\left\{\begin{array}[]{ccc}(t,x)\in[0,1]\times{\mathscr{X}}&:&\begin{array}[]{c}t^{(j_{0})}<t\leq t^{(j_{0}+1)}\\ x^{(1,j_{1})}<x^{(1)}\leq x^{(1,j_{1}+1)}\\ \vdots\\ x^{(p,j_{p})}<x^{(p)}\leq x^{(p,j_{p}+1)}\end{array}\end{array}\right\}.

B_{j}=\left\{\begin{array}[]{ccc}(t,x)\in[0,1]\times{\mathscr{X}}&:&\begin{array}[]{c}t^{(j_{0})}<t\leq t^{(j_{0}+1)}\\ x^{(1,j_{1})}<x^{(1)}\leq x^{(1,j_{1}+1)}\\ \vdots\\ x^{(p,j_{p})}<x^{(p)}\leq x^{(p,j_{p}+1)}\end{array}\end{array}\right\}.

(F, ⟨ \cdot, \cdot ⟩_{μ_{n}}) .

(F, ⟨ \cdot, \cdot ⟩_{μ_{n}}) .

{φ_{nj} (t, x)}_{j} = {\frac{I _{B_{j}} ( t , x )}{μ _{n} ( B _{j} ) ^{1/2}} : μ_{n} (B_{j}) > 0},

{φ_{nj} (t, x)}_{j} = {\frac{I _{B_{j}} ( t , x )}{μ _{n} ( B _{j} ) ^{1/2}} : μ_{n} (B_{j}) > 0},

R_{n} (F) = \int (e^{F} - λ_{n} F) d μ_{n},

R_{n} (F) = \int (e^{F} - λ_{n} F) d μ_{n},

λ_{n} (t, x) = \frac{1}{n} j \sum {i = 1 \sum n Δ_{i} φ_{nj} (T_{i}, X_{i} (T_{i}))} φ_{nj} (t, x) .

λ_{n} (t, x) = \frac{1}{n} j \sum {i = 1 \sum n Δ_{i} φ_{nj} (T_{i}, X_{i} (T_{i}))} φ_{nj} (t, x) .

R_{n} (F + f) = R_{n} (F) + ⟨ g_{F}, f ⟩_{μ_{n}} + \frac{1}{2} \int e^{F + ρ f} f^{2} d μ_{n}

R_{n} (F + f) = R_{n} (F) + ⟨ g_{F}, f ⟩_{μ_{n}} + \frac{1}{2} \int e^{F + ρ f} f^{2} d μ_{n}

g_{F} (t, x) = j \sum ⟨ e^{F}, φ_{nj} ⟩_{μ_{n}} φ_{nj} (t, x) - λ_{n} (t, x)

g_{F} (t, x) = j \sum ⟨ e^{F}, φ_{nj} ⟩_{μ_{n}} φ_{nj} (t, x) - λ_{n} (t, x)

λ_{n} (t, x)

λ_{n} (t, x)

R_{n} (F)

g_{F} (t, x)

Fail_{j} = i \sum Δ_{i} I [{T_{i}, X_{i} (T_{i})} \in B_{j}]

Fail_{j} = i \sum Δ_{i} I [{T_{i}, X_{i} (T_{i})} \in B_{j}]

R_{n} (F) = \int e^{F} d μ_{n} - \frac{1}{n} i = 1 \sum n Δ_{i} F (T_{i}, X_{i} (T_{i})) .

R_{n} (F) = \int e^{F} d μ_{n} - \frac{1}{n} i = 1 \sum n Δ_{i} F (T_{i}, X_{i} (T_{i})) .

\int λ_{n} F d μ_{n}

\int λ_{n} F d μ_{n}

\frac{d}{d θ} R_{n} (F + θ f)

\frac{d}{d θ} R_{n} (F + θ f)

\frac{d ^{2}}{d θ ^{2}} R_{n} (F + θ f)

R (F) = E {R_{n} (F)} = \int (e^{F} - λ F) d μ .

R (F) = E {R_{n} (F)} = \int (e^{F} - λ F) d μ .

\frac{1}{2} R (F) \geq \frac{Λ _{L}}{α _{F}} ∥ F ∥_{\infty} + Λ_{U} min {0, 1 - lo g (2 Λ_{U})},

\frac{1}{2} R (F) \geq \frac{Λ _{L}}{α _{F}} ∥ F ∥_{\infty} + Λ_{U} min {0, 1 - lo g (2 Λ_{U})},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Boosted nonparametric hazards with time-dependent covariatesDonald K.K. Lee111Correspondence: [email protected]. Supported by a hyperplane, Ningyuan Chen222Supported by the HKUST start-up fund R9382, Hemant Ishwaran333Supported by the NIH grant R01 GM125072Emory University, University of Toronto, University of Miami

Preprint of Annals of Statistics 49:4:2101-2128 (2021)

Given functional data from a survival process with time-dependent covariates, we derive a smooth convex representation for its nonparametric log-likelihood functional and obtain its functional gradient. From this we devise a generic gradient boosting procedure for estimating the hazard function nonparametrically. An illustrative implementation of the procedure using regression trees is described to show how to recover the unknown hazard. The generic estimator is consistent if the model is correctly specified; alternatively an oracle inequality can be demonstrated for tree-based models. To avoid overfitting, boosting employs several regularization devices. One of them is step-size restriction, but the rationale for this is somewhat mysterious from the viewpoint of consistency. Our work brings some clarity to this issue by revealing that step-size restriction is a mechanism for preventing the curvature of the risk from derailing convergence. ††MSC 2010 subject classifications. Primary 62N02; Secondary 62G05, 90B22.††Keywords. survival analysis, gradient boosting, functional data, step-size shrinkage, regression trees, likelihood functional.

1 Introduction

Flexible hazard models involving time-dependent covariates are indispensable tools for studying systems that track covariates over time. In medicine, electronic health records systems make it possible to log patient vitals throughout the day, and these measurements can be used to build real-time warning systems for adverse outcomes such as cancer mortality [2]. In financial technology, lenders track obligors’ behaviours over time to assess and revise default rate estimates. Such models are also used in many other fields of scientific inquiry since they form the building blocks for transitions within a Markovian state model. Indeed, this work was partly motivated by our study of patient transitions in emergency department queues and in organ transplant waitlist queues [20]. For example, allocation for a donor heart in the U.S. is defined in terms of coarse tiers [23], and transplant candidates are assigned to tiers based on their health status at the time of listing. However, a patient’s condition may change rapidly while awaiting a heart, and this time-dependent information may be the most predictive of mortality and not the static covariates collected far in the past.

The main contribution of this paper is to introduce a fully nonparametric boosting procedure for hazard estimation with time-dependent covariates. We describe a generic gradient boosting procedure for boosting arbitrary base learners for this setting. Generally speaking, gradient boosting adopts the view of boosting as an iterative gradient descent algorithm for minimizing a loss functional over a target function space. Early work includes Breiman [6, 7, 8] and Mason et al. [21, 22]. A unified treatment was provided by Friedman [13], who coined the term “gradient boosting” which is now generally taken to be the modern interpretation of boosting.

Most of the existing boosting approaches for survival data focus on time-static covariates and involve boosting the Cox proportional hazards model. Examples include the popular R-packages mboost (Bühlmann and Hothorn [10]) and gbm (Ridgeway [26]) which apply gradient boosting to the Cox partial likelihood loss. Related work includes the penalized Cox partial likelihood approach of Binder and Schumacher [4]. Other important approaches, but not based on the Cox model, include $L_{2}$ Boosting [11] with inverse probability of censoring weighting (IPCW) [17], boosted transformation models of parametric families [15], and boosted accelerated failure time models [18, 27].

While there are many boosting methods for dealing with time-static covariates, the literature is far more sparse for the case of time-dependent covariates. In fact, to our knowledge there is no general nonparametric approach for dealing with this setting. This is because in order to implement a fully nonparametric estimator, one has to contend with the issue of identifying the gradient, which turns out to be a non-trivial problem due to the functional nature of the data. This is unlike most standard applications of gradient boosting where the gradient can easily be identified and calculated.

*1.1. Time-dependent covariate framework. * To explain why this is so challenging, we start by formally defining the survival problem with time-dependent covariates. Our description follows the framework of Aalen [1]. Let $T$ denote the potentially unobserved failure time. Conditional on the history up to time $t-$ the probability of failing at $T\in[t,t+dt)$ equals

[TABLE]

Here $\lambda(t,x)$ denotes the unknown hazard function, $X(t)\in{\mathscr{X}}\subseteq\mathbb{R}^{p}$ is a predictable covariate process, and $Y(t)\in\{0,1\}$ is a predictable indicator of whether the subject is at risk at time $t$ .111The filtration of interest is $\sigma\{X(s),Y(s),I(T\leq s):s\leq t\}$ . If $X(t)$ is only observable when $Y(t)=1$ , we can set $X(t)=x^{c}\notin{\mathscr{X}}$ whenever $Y(t)=0$ . To simplify notation, without loss of generality we normalize the units of time so that $Y(t)=0$ for $t>1$ .222Since the data is always observed up to some finite time, there is no information loss from censoring at that point. For example, if $T^{\prime}$ is the failure time in minutes and the longest duration in the data is $\tau^{\prime}=60$ minutes, the failure time in hours, $T$ , is at most $\tau=1$ hour. The hazard function on the minute timescale, $\lambda_{T^{\prime}}(t^{\prime},X(t^{\prime}))$ , can be recovered from the hazard function on the hourly timescale, $\lambda_{T}(t,X(t))$ , via $\lambda_{T^{\prime}}(t^{\prime},X(t^{\prime}))=\frac{1}{\tau^{\prime}}\lambda_{T}(\frac{t^{\prime}}{\tau^{\prime}},X(\frac{t^{\prime}}{\tau^{\prime}}))$ . In other words, the subject is not at risk after time $t=1$ , so we can restrict attention to the time interval $(0,1]$ .

If failure is observed at $T\in(0,1]$ then the indicator $\Delta=Y(T)$ equals 1; otherwise $\Delta=0$ and we set $T$ to an arbitrary number larger than 1, e.g. $T=\infty$ . Throughout we assume we observe $n$ independent and identically distributed functional data samples $\{(X_{i}(\cdot),Y_{i}(\cdot),T_{i})\}_{i=1}^{n}$ . The evolution of observation $i$ ’s failure status can then be thought of as a sequence of coin flips at time increments $t=0,dt,2dt,\cdots$ , with the probability of “heads” at each time point given by (1). Therefore, observation $i$ ’s contribution to the likelihood is

[TABLE]

where the limit can be understood as a product integral. Hence, if the log-hazard function is

[TABLE]

then the (scaled) negative log-likelihood functional is

[TABLE]

which we shall refer to as the likelihood risk. The goal is to estimate the hazard function $\lambda(t,x)=e^{F(t,x)}$ nonparametrically by minimizing ${R}_{n}(F)$ .

*1.2. The likelihood does not have a gradient in generic function spaces. * As mentioned, our approach is to boost $F$ using functional gradient descent. However, the chief difficulty is that the canonical representation of the likelihood risk functional does not have a gradient. To see this, observe that the directional derivative of (2) equals

[TABLE]

which is the difference of two different inner products $\left\langle e^{F},f\right\rangle_{\dagger}-\left\langle 1,f\right\rangle_{\ddagger}$ where

[TABLE]

Hence, (3) cannot be expressed as a single inner product of the form $\langle g_{F},f\rangle$ for some function $g_{F}(t,x)$ . Were it possible to do so, $g_{F}$ would then be the gradient function.

In simpler non-functional data settings like regression or classification, the loss can be written as $L(Y,\tilde{F}(x))$ , where $\tilde{F}$ is the non-functional statistical target and $Y$ is the outcome, so the gradient is simply $\partial L(Y,\tilde{F}(x))/\partial\tilde{F}(x)$ . The negative gradient is then approximated using a base learner $f\in\tilde{{\mathscr{F}}}$ from a predefined class of functions $\tilde{{\mathscr{F}}}$ (this being either parametric; for example linear learners, or nonparametric; for example tree learners). Typically, the optimal base learner $\hat{f}$ is chosen to minimize the $L^{2}$ -approximation error and then scaled by a regularization parameter $0<\nu\leq 1$ to obtain the updated estimate of $\tilde{F}$ :

[TABLE]

Importantly, in the simpler non-functional data setting the gradient does not depend on the space that $\tilde{F}$ belongs to. By contrast, a key insight of this paper is that the gradient of ${R}_{n}(F)$ can only be defined after carefully specifying an appropriate sample-dependent domain for ${R}_{n}(F)$ . The likelihood risk can then be re-expressed as a smooth convex functional, and an analogous representation also exists for the population risk. These representations resolve the difficulty above, allow us to describe and implement a gradient boosting procedure, and are also crucial to establishing guarantees for our estimator.

*1.3. Contributions of the paper. * A key discovery that unlocks the boosted hazard estimator is Proposition 1 of Section 1. It provides an integral representation for the likelihood risk from which several results follow, including, importantly, an explicit representation for the gradient. Proposition 1 relies on defining a suitable space of log-hazard functions defined on the time-covariate domain $[0,1]\times{\mathscr{X}}$ . Identifying this space is the key insight that allows us to rescue the likelihood approach and to derive the gradient needed to implement gradient boosting. Arriving at this framework is not conceptually trivial, and may explain the absence of boosted nonparametric hazard estimators until now.

Algorithm 1 of Section 1 describes our estimator. The algorithm minimizes the likelihood risk (2) over the defined space of log-hazard functions. In the special case of regression tree learners, expressions for the likelihood risk and its gradient are obtained from Proposition 1, which are then used to describe a tree-based implementation of our estimator in Section 1. In Section 1 we apply it to a high-dimensional dataset generated from a naturalistic simulation of patient service times in an emergency department.

Section 1 establishes the consistency of the procedure. We show that the hazard estimator is consistent if the space is correctly specified. In particular, if the space is the span of regression trees, then the hazard estimator satisfies an oracle inequality and recovers $\lambda$ up to some error tolerance (Propositions 3 and 4).

Another contribution of our work is to clarify the mechanisms used by gradient boosting to avoid overfitting. Gradient boosting typically applies two types of regularization to invoke slow learning: (i) A small step-size is used for the update; and (ii) The number of boosting iterations is capped. The number of iterations used in our algorithm is set using the framework of Zhang and Yu [31], whose work shows how stopping early ensures consistency. On the other hand, the role of step-size restriction is more mysterious. While [31] demonstrates that small step-sizes are needed to prove consistency, unrestricted greedy step-sizes are already small enough for classification problems [28] and also for commonly used regression losses (see the Appendix of [31]). We show in Section 1 that shrinkage acts as a counterweight to the curvature of the risk (see Lemma 2). Hence if the curvature is unbounded, as is the case for hazard regression, then the step-sizes may need to be explicitly controlled to ensure convergence. This important result adds to our understanding of statistical convergence in gradient boosting. As noted by Biau and Cadre [3] the literature for this is relatively sparse, which motivated them to propose another regularization mechanism that also prevents overfitting.

Concluding remarks can be found in Section 1. Proofs not appearing in the body of the paper can be found in the Appendix.

2. The boosted hazard estimator. In this section, we describe our boosted hazard estimator. To provide readers with concrete examples for the ideas introduced here, we will show how the quantities defined in this section specialize in the case of regression trees, which is one of a few possible ways to implement boosting.

We begin by defining in Section 1 an appropriate sample-dependent domain for the likelihood risk ${R}_{n}(F)$ . As explained, this key insight allows us to re-express the likelihood risk and its population analogue as smooth convex functionals, thereby enabling us to compute their gradients in closed form in Propositions 1 and 2 of Section 1. Following this, the boosting algorithm is formally stated in Section 1.

*2.1. Specifying a domain for ${R}_{n}(F)$ . * We will make use of two identifiability conditions (A1) and (A2) to define the domain for ${R}_{n}(F)$ . Condition (A1) below is the same as Condition 1(iv) of Huang and Stone [19].

Assumption (A1).

The true hazard function $\lambda(t,x)$ is bounded between some interval $[\Lambda_{L},\Lambda_{U}]\subset(0,\infty)$ on the time-covariate domain $[0,1]\times{\mathscr{X}}$ .

Recall that we defined $X(\cdot)$ and $Y(\cdot)$ to be predictable processes, and so it can be shown that the integrals and expectations appearing in this paper are all well defined. Denoting the indicator function as $I(\cdot)$ , define the following population and empirical sub-probability measures on $[0,1]\times{\mathscr{X}}$ :

[TABLE]

and note that $\mathbb{E}{\mu_{n}}(B)=\mu(B)$ because the data is i.i.d. by assumption. Intuitively, ${\mu_{n}}$ measures the denseness of the observed sample time-covariate paths on $[0,1]\times{\mathscr{X}}$ . For any integrable $f$ ,

[TABLE]

This allows us to define the following (random) norms and inner products

[TABLE]

and note that $\|\cdot\|_{{\mu_{n}},1}\leq\|\cdot\|_{{\mu_{n}},2}\leq\|\cdot\|_{\infty}$ because ${\mu_{n}}([0,1]\times{\mathscr{X}})\leq 1$ .

By careful design, ${\mu_{n}}$ allows us to specify a natural domain for ${R}_{n}(F)$ . Let $\{\phi_{j}(t,x)\}_{j=1}^{d}$ be a set of bounded functions $[0,1]\times{\mathscr{X}}\mapsto[-1,1]$ that are linearly independent, in the sense that $\int_{[0,1]\times{\mathscr{X}}}(\sum_{j}c_{j}\phi_{j})^{2}dtdx=0$ if and only if $c_{1}=\cdots=c_{d}=0$ (when some of the covariates are discrete-valued, $dx$ should be interpreted as the product of a counting measure and the Lebesgue measure). The span of the functions is

[TABLE]

For example, the span of all regression tree functions that can be defined on $[0,1]\times{\mathscr{X}}$ is ${\mathscr{F}}=\{\sum_{j}c_{j}I_{B_{j}}(t,x):c_{j}\in\mathbb{R}\}$ ,333It is clear that said span is contained in ${\mathscr{F}}$ . For the converse, it suffices to show that ${\mathscr{F}}$ is also contained in the span of trees of some depth. This is easy to show for trees with $p+1$ splits, because they can generate partitions of the form $(-\infty,t]\times(-\infty,x^{(1)}]\times\cdots\times(-\infty,x^{(p)}]$ in $[0,1]\times{\mathscr{X}}$ (Section 3 of [8]). which are linear combinations of indicator functions over disjoint time-covariate cubes indexed444With a slight abuse of notation, the index $j$ is only considered multi-dimensional when describing the geometry of $B_{j}$ , such as in (6). In all other situations $j$ should be interpreted as a scalar index. by $j=(j_{0},j_{1},\cdots,j_{p})$ :

[TABLE]

Remark 1.

The regions $B_{j}$ are formed using all possible split points $\{x^{(k,j_{k})}\}_{j_{k}}$ for the $k$ -th coordinate $x^{(k)}$ , with the spacing determined by the precision of the measurements. For example, if weight is measured to the closest kilogram, then the set of all possible split points will be $\{0.5,1.5,2.5,\cdots\}$ kilograms. Note that these split points are the finest possible for any realization of weight that is measured to the nearest kilogram. While abstract treatments of trees assume that there is a continuum of split points, in reality they fall on a discrete (but fine) grid that is pre-determined by the precision of the data.

When ${\mathscr{F}}$ is equipped with $\left\langle\cdot,\cdot\right\rangle_{{\mu_{n}}}$ , we obtain the following sample-dependent subspace of $L^{2}({\mu_{n}})$ , which is the appropriate domain for ${R}_{n}(F)$ :

[TABLE]

Note that the elements in $({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{{\mu_{n}}})$ are equivalence classes rather than actual functions that have well defined values at each $(t,x)$ . This is a problem because the likelihood risk (2) requires evaluating $F(t,x)$ at the points $(T_{i},X_{i}(T_{i}))$ where $\Delta_{i}=1$ . We resolve this by fixing an orthonormal basis $\{{\varphi}_{nj}(t,x)\}_{j}$ for $({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{{\mu_{n}}})$ , and represent each member of $({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{{\mu_{n}}})$ uniquely in the form $\sum_{j}c_{j}{\varphi}_{nj}(t,x)$ . For example in the case of regression trees, applying the Gram-Schmidt procedure to $\{\phi_{j}(t,x)=I_{B_{j}}(t,x)\}_{j}$ gives

[TABLE]

which by design have disjoint support.

The second condition we impose is for $\{\phi_{j}\}_{j=1}^{d}$ to be linearly independent in $L^{2}(\mu)$ , that is $\|\sum_{j}c_{j}\phi_{j}\|_{\mu,2}^{2}=\sum_{ij}c_{i}\left(\int\phi_{i}\phi_{j}d\mu\right)c_{j}=0$ if and only if $c_{1}=\cdots=c_{d}=0$ . Since by construction $\{\phi_{j}\}_{j=1}^{d}$ are already linearly independent in $[0,1]\times{\mathscr{X}}$ , the condition intuitively requires the set of all possible time-covariate trajectories to be adequately dense in $[0,1]\times{\mathscr{X}}$ to intersect a sufficient amount of the support of every $\phi_{j}$ . This is weaker than the identifiability conditions 1(ii)-1(iii) in [19] which require $X(t)$ to have a positive joint probability density on $[0,1]\times{\mathscr{X}}$ .

Assumption (A2).

The Gram matrix $\Sigma_{ij}=\int\phi_{i}\phi_{j}d\mu$ is positive definite.

*2.2. Integral representations for the likelihood risk. * Having deduced the appropriate domain for ${R}_{n}(F)$ , we can now recast the risk as a smooth convex functional on $({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{{\mu_{n}}})$ . Proposition 1 below provides closed form expressions for this and its gradient. We note that if the risk is actually of a certain simpler form, it might be possible to estimate its gradient empirically from our risk expression using [24].

Proposition 1.

For functions $F(t,x),f(t,x)$ of the form $\sum_{j}c_{j}{\varphi}_{nj}(t,x)$ , the likelihood risk (2) can be written as

[TABLE]

where ${\lambda_{n}}\in({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{{\mu_{n}}})$ is the function

[TABLE]

Thus there exists $\rho\in(0,1)$ (depending on $F$ and $f$ ) for which the Taylor representation

[TABLE]

holds, where the gradient

[TABLE]

of ${R}_{n}(F)$ is the projection of $e^{F}-{\lambda_{n}}$ onto $({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{{\mu_{n}}})$ . Hence if $g_{F}=0$ then the infimum of ${R}_{n}(F)$ over the span of $\{{\varphi}_{nj}(t,x)\}_{j}$ is uniquely attained at $F$ .

For regression trees the expressions (7) and (9) simplify further because ${\mathscr{F}}$ is closed under pointwise exponentiation, i.e. $e^{F}\in{\mathscr{F}}$ for $F\in{\mathscr{F}}$ . This is because the $B_{j}$ ’s are disjoint so $F=\sum_{j}c_{j}I_{B_{j}}$ and hence $e^{F}=\sum_{j}e^{c_{j}}I_{B_{j}}$ . Thus

[TABLE]

where

[TABLE]

is the number of observed failures in the time-covariate region $B_{j}$ .

Proof of Proposition 1.

Fix a realization of $\{(X_{i}(\cdot),Y_{i}(\cdot),T_{i})\}_{i=1}^{n}$ . Using (5) we can rewrite (2) as

[TABLE]

We can express $F$ in terms of the basis $\{{\varphi}_{nk}\}_{k}$ as $F(t,x)=\sum_{k}c_{k}{\varphi}_{nk}(t,x)$ . Hence

[TABLE]

where the fourth equality follows from the orthonormality of the basis. This completes the derivation of (7).

By an interchange argument we obtain

[TABLE]

the latter being positive whenever $f\neq 0$ ; i.e., ${R}_{n}(F)$ is convex. The Taylor representation (8) then follows from noting that $g_{F}$ is the orthogonal projection of $e^{F}-{\lambda_{n}}\in L^{2}({\mu_{n}})$ onto $({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{{\mu_{n}}})$ . ∎

The expectation of the likelihood risk also has an integral representation. A special case of the representation (13) below is proved in Proposition 3.2 of [19] for right-censored data only, under assumptions that do not allow for internal covariates. In the statement of the proposition below recall that $\Lambda_{L}$ and $\Lambda_{U}$ are defined in (A1). The constant ${\alpha_{\!\mathchoice{\raisebox{0.0pt}{$ \displaystyle{\mathscr{F}} $}}{\raisebox{0.0pt}{$ {\mathscr{F}} $}}{\raisebox{-0.85pt}{$ \scriptstyle{\mathscr{F}} $}}{\raisebox{-0.4pt}{$ \scriptscriptstyle{\mathscr{F}} $}}}}$ is defined later in (28).

Proposition 2.

For $F\in{\mathscr{F}}\cup\{\log\lambda\}$ ,

[TABLE]

Furthermore the restriction of $R(F)$ to ${\mathscr{F}}$ is coercive:

[TABLE]

and it attains its minimum at a unique point $F^{*}\in({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{\mu})$ . If ${\mathscr{F}}$ contains the underlying log-hazard function then $F^{*}=\log\lambda$ .

Remark 2.

Coerciveness (14) implies that any $F$ with expected risk $R(F)$ less than $R(0)\leq 1<3$ is uniformly bounded:

[TABLE]

where the constant

[TABLE]

is by design no smaller than 1 in order to simplify subsequent analyses.

*2.3. The boosting procedure. * In gradient boosting the key idea is to update an iterate in a direction that is approximately aligned to the negative gradient. To model this direction formally, we introduce the concept of an $\varepsilon$ -gradient.

Definition 1.

Suppose $g_{F}\neq 0$ . We say that a unit vector $g_{F}^{\varepsilon}\in({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{{\mu_{n}}})$ is an $\varepsilon$ -gradient at $F$ if for some $0<\varepsilon\leq 1$ ,

[TABLE]

Call $-g_{F}^{\varepsilon}$ a negative $\varepsilon$ -gradient if $g_{F}^{\varepsilon}$ is an $\varepsilon$ -gradient.

Our boosting procedure seeks approximations $g_{F}^{\varepsilon}$ that satisfy (17) for some pre-specified alignment value $\varepsilon$ . The larger $\varepsilon$ is, the closer the alignment is between the negative gradient and the negative $\varepsilon$ -gradient, and the greater the risk reduction. In particular, $-g_{F}$ is the unique negative 1-gradient with maximal risk reduction. In practice, however, we find that using a smaller value of $\varepsilon$ leads to simpler approximations that prevent overfitting in finite samples. This is consistent with other implementations of boosting: It is well known that the statistical performance of gradient descent generally improves when simpler base learners are used.

Algorithm 1 describes the proposed boosting procedure for estimating $\lambda$ . For a given level of alignment $\varepsilon$ , Line 3 finds an $\varepsilon$ -gradient $g_{{F}_{m}}^{\varepsilon}$ at ${F}_{m}$ satisfying (18) at the $m$ -th iteration, and uses its negation for the boosting update in Line 4. If the $\varepsilon$ -gradients are tree learners, as is the case with the implementation in Section 1, then the trees cannot be grown in the same way as the standard boosting algorithm in Friedman [13]. This is because the standard approach grows all regression trees to a fixed depth, which may or may not ensure $\varepsilon$ -alignment at each boosting iteration.

To ensure $\varepsilon$ -alignment, the depth of the trees are not fixed in the implementation in Section 1. Instead, at each boosting iteration a tree is grown to whatever depth is needed to satisfy (18). This can always be done because the alignment $\varepsilon$ is non-decreasing in the number of tree splits, and with enough splits we can recover the gradient $g_{{F}_{m}}$ itself up to ${\mu_{n}}$ -almost everywhere.555Split the tree until each leaf node contains just one of the regions $B_{j}$ in (6) with ${\mu_{n}}(B_{j})>0$ . Then set the value of the node equal to the value of the gradient function (12) inside $B_{j}$ . As mentioned earlier, we recommend using small values of $\varepsilon$ , which can be determined in practice using cross-validation. This differs from the standard approach where cross-validation is used to select a common tree depth to use for all boosting iterations.

In addition to the gradient alignment $\varepsilon$ , Algorithm 1 makes use of two other regularization parameters, $\Psi_{n}$ and $\nu_{n}$ . The first defines the early stopping criterion (how many boosting iterations to use), while the second controls the step-sizes of the boosting updates. These are two common regularization techniques used in boosting:

Early stopping. The number of boosting iterations ${\hat{m}}$ is controlled by stopping the algorithm before the uniform norm of the estimator $\|F_{{\hat{m}}}\|_{\infty}$ reaches or exceeds

[TABLE]

where $W(y)$ is the branch of the Lambert function that returns the real root of the equation $ze^{z}=y$ for $y>0$ . 2. 2.

Step-sizes. The step-size $\nu_{n}\ll 1$ used in gradient boosting is typically held constant across iterations. While we can also do this in our procedure,666The term $\nu_{n}^{2}e^{\Psi_{n}}$ in condition (20) would need to be replaced by ${\hat{m}}\nu_{n}^{2}e^{\Psi_{n}}$ if a constant step-size is used. the role of step-size shrinkage becomes more salient if we use $\nu_{n}/(m+1)$ instead as the step-size for the $m$ -th iteration in Algorithm 1. This step-size is controlled in two ways. First, it is made to decrease with each iteration according to the Robbins-Monro condition that the sum of the steps diverges while the sum of squared steps converges. Second, the shrinkage factor $\nu_{n}$ is selected to make the step-sizes decay with $n$ at rate

[TABLE]

This acts as a counterbalance to ${R}_{n}(F)$ ’s unbounded curvature:

[TABLE]

which is upper bounded by $e^{\Psi_{n}}$ when $\|F\|_{\infty}<\Psi_{n}$ and $\|f\|_{{\mu_{n}},2}=1$ .

3. Consistency. Under (A1) and (A2), guarantees for our hazard estimator ${\hat{\lambda}}_{\textrm{boost}}$ in Algorithm 1 can be derived for two scenarios of interest. The guarantees rely on the regularizations described in Section 1 to avoid overfitting. In the following development, recall from Proposition 2 that $F^{*}$ is the unique minimizer of $R(F)$ , so it satisfies the first order condition

[TABLE]

for all $F\in{\mathscr{F}}$ . Recall that the span of all trees is closed under pointwise exponentiation ( $e^{F}\in{\mathscr{F}}$ ), in which case (22) implies that $\lambda^{*}=e^{F^{*}}$ is the orthogonal projection of $\lambda$ onto $({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{\mu})$ .

Consistency when ${\mathscr{F}}$ is correctly specified. If the true log-hazard function $\log\lambda$ is in ${\mathscr{F}}$ , then Proposition 2 asserts that $F^{*}=\log\lambda$ . It will be shown in this case that ${\hat{\lambda}}_{\textrm{boost}}$ is consistent:

[TABLE] 2. 2.

Oracle inequality for regression trees. If ${\mathscr{F}}$ is closed under pointwise exponentiation, it follows from (22) that $\lambda^{*}$ is the best $L^{2}(\mu)$ -approximation to $\lambda$ among all candidate hazard estimators $\{e^{F}:F\in{\mathscr{F}}\}$ . It can then be shown that ${\hat{\lambda}}_{\textrm{boost}}$ converges to this best approximation:

[TABLE]

This oracle result is in the spirit of the type of guarantees available for tree-based boosting in the non-functional data setting. For example, if tree stumps are used for $L_{2}$ -regression, then the regression function estimate will converge to the best approximation to the true regression function in the span of tree stumps [9]. Similar results also exist for boosted classifiers [5].

Propositions 3 and 4 below formalize these guarantees by providing bounds on the error terms above. While sharper bounds may exist, the purpose of this paper is to introduce our generic estimator for the first time and to provide guarantees that apply across different implementations. More refined convergence rates may exist for a specific implementation, just like the analysis in Bühlmann and Yu [11] for $L_{2}$ Boosting when componentwise spline learners are specifically used.

En route to establishing the guarantees, Lemma 2 below clarifies the role played by step-size restriction in ensuring convergence of the estimator. As explained in the Introduction, explicit shrinkage is not necessary for classification and regression problems where the risk has bounded curvature. Lemma 2 suggests that it may, however, be needed when the risk has unbounded curvature, as is the case with ${R}_{n}(F)$ . Seen in this light, shrinkage is really a mechanism for controlling the growth of the risk curvature.

*3.1. Strategy for establishing guarantees. * The representations for ${R}_{n}(F)$ and its population analogue $R(F)$ from Section 1 are the key ingredients for formalizing the guarantees. We use them to first show that ${F}_{{\hat{m}}}\in({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{{\mu_{n}}})$ converges to $F^{*}\in({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{\mu})$ : Applying Taylor’s theorem to the representation for $R(F)$ in Proposition 2 yields

[TABLE]

The problem is thus transformed into one of risk minimization $R({F}_{{\hat{m}}})\rightarrow R(F^{*})$ , for which [31] suggests analyzing separately the terms of the decomposition

[TABLE]

The authors argue that in boosting, the point of limiting the number of iterations ${\hat{m}}$ (enforced by lines 5-10 in Algorithm 1) is to prevent ${F}_{{\hat{m}}}$ from growing too fast, so that (I) converges to zero as $n\rightarrow\infty$ . At the same time, ${\hat{m}}$ is allowed to grow with $n$ in a controlled manner so that the empirical risk ${R}_{n}({F}_{{\hat{m}}})$ in (III) is eventually minimized as $n\rightarrow\infty$ . Lemmas 1 and 2 below show that our procedure achieves both goals. Lemma 1 makes use of complexity theory via empirical processes, while Lemma 2 deals with the curvature of the likelihood risk. The term (II) will be bounded using standard concentration results.

*3.2. Bounding (I) using complexity. * To capture the effect of using a simple negative $\varepsilon$ -gradient (17) as the descent direction, we bound (I) in terms of the complexity of777 For technical convenience, ${\mathscr{F}}_{\varepsilon}$ has been enlarged from ${\mathscr{F}}_{\varepsilon,\textrm{boost}}$ to include the unit ball.

[TABLE]

Depending on the choice of weak learners for the $\varepsilon$ -gradients, ${\mathscr{F}}_{\varepsilon}$ may be much smaller than ${\mathscr{F}}$ . For example, coordinate descent might only ever select a small subset of basis functions $\{\phi_{j}\}_{j}$ because of sparsity. As another example if $\lambda(t,x)$ is additively separable in time and also in each covariate, then regression trees might only ever select simple tree stumps (one tree split).

The measure of complexity we use below comes from empirical process theory. Define ${\mathscr{F}}_{\varepsilon}^{\Psi}=\{F\in{\mathscr{F}}_{\varepsilon}:\|F\|_{\infty}<\Psi\}$ for $\Psi>0$ and suppose that $Q$ is a sub-probability measure on $[0,1]\times{\mathscr{X}}$ . Then the $L^{2}(Q)$ -ball of radius $\delta>0$ centred at some $F\in L^{2}(Q)$ is $\{F^{\prime}\in{\mathscr{F}}_{\varepsilon}^{\Psi}:\|F^{\prime}-F\|_{Q,2}<\delta\}$ . The covering number ${\mathcal{N}}(\delta,{\mathscr{F}}_{\varepsilon}^{\Psi},Q)$ is the minimum number of such balls needed to cover ${\mathscr{F}}_{\varepsilon}^{\Psi}$ (Definitions 2.1.5 and 2.2.3 of van der Vaart and Wellner [29]), so ${\mathcal{N}}(\delta,{\mathscr{F}}_{\varepsilon}^{\Psi},Q)=1$ for $\delta\geq\Psi$ . A complexity measure for ${\mathscr{F}}_{\varepsilon}$ is

[TABLE]

where the supremum is taken over $\Psi>0$ and over all non-zero sub-probability measures. As discussed, $J_{{\mathscr{F}}_{\varepsilon}}$ is never greater than, and potentially much smaller than $J_{\mathscr{F}}$ , the complexity of ${\mathscr{F}}$ , which is fixed and finite.

Before stating Lemma 1, we note that the result also shows an empirical analogue to the norm equivalences

[TABLE]

exists, where

[TABLE]

The factor of 2 serves to simplify the presentation, and can be replaced with anything greater than 1.

Lemma 1.

There exists a universal constant $\kappa$ such that for any $0<\eta<1$ , with probability at least

[TABLE]

an empirical analogue to (27) holds for all $F\in{\mathscr{F}}$ :

[TABLE]

and for all $F\in{\mathscr{F}}_{\varepsilon}^{\Psi_{n}}$ ,

[TABLE]

Remark 3.

The equivalences (29) imply that $\dim({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{{\mu_{n}}})$ equals its upper bound $\dim{\mathscr{F}}=d$ . That is, if $\|\sum_{j}c_{j}\phi_{j}\|_{{\mu_{n}},2}=0$ , then $\|\sum_{j}c_{j}\phi_{j}\|_{\infty}=0$ , so $c_{1}=\cdots=c_{d}=0$ because $\{\phi_{j}\}_{j=1}^{d}$ are linearly independent on $[0,1]\times{\mathscr{X}}$ .

*3.3. Bounding (III) using curvature. * We use the representation in Proposition 1 to study the minimization of the empirical risk ${R}_{n}(F)$ by boosting. Standard results for exact gradient descent like Theorem 2.1.15 of Nesterov [25] are in terms of the norm of the minimizer, which may not exist for ${R}_{n}(F)$ .888The infimum of ${R}_{n}(F)$ is not always attainable: If $f$ is non-positive and vanishes on the set $\{\{T_{i},X_{i}(T_{i})\}:\Delta_{i}=1\}$ , then ${R}_{n}(F+\theta f)=\int(e^{F+\theta f}-{\lambda_{n}}F)d{\mu_{n}}$ is decreasing in $\theta$ so $f$ is a direction of recession. This is however not an issue for boosting because of early stopping. If coordinate descent is used instead, Section 4.1 of [31] can be applied to convex functions whose infimum may not be attainable, but its curvature is required to be uniformly bounded above. Since the second derivative of ${R}_{n}(F)$ is unbounded (21), Lemma 2 below provides two remedies: (i) Use the shrinkage decay (20) of $\nu_{n}$ to counterbalance the curvature; (ii) Use coercivity (15) to show that with increasing probability, $\{{F}_{m}\}_{m=0}^{\hat{m}}$ are uniformly bounded, so the curvatures at those points are also uniformly bounded. Lemma 2 combines both to derive a result that is simpler than what can be achieved from either one alone. In doing so, the role played by step-size restriction becomes clear. The lemma relies in part on adapting the analysis in Lemma 4.1 of [31] for coordinate descent to the case for generic $\varepsilon$ -gradients. The conditions required below will be shown to hold with high probability.

Lemma 2.

Suppose (29) holds and that

[TABLE]

Then the largest gap between $F^{*}$ and $\{{F}_{m}\}_{m=0}^{\hat{m}}$ ,

[TABLE]

is bounded by a constant no greater than $2{\alpha_{\!\mathchoice{\raisebox{0.0pt}{$ \displaystyle{\mathscr{F}} $}}{\raisebox{0.0pt}{$ {\mathscr{F}} $}}{\raisebox{-0.85pt}{$ \scriptstyle{\mathscr{F}} $}}{\raisebox{-0.4pt}{$ \scriptscriptstyle{\mathscr{F}} $}}}}\beta_{\Lambda}$ , and for $n\geq 55$ ,

[TABLE]

Remark 4.

The last term in (32) suggests that the role of the step-size shrinkage $\nu_{n}$ is to keep the curvature of the risk in check, to prevent it from derailing convergence. Recall from (21) that $e^{\Psi_{n}}$ describes the curvature of ${R}_{n}({F}_{m})$ . Thus our result clarifies the role of step-size restriction in boosting functional data.

Remark 5.

Regardless of whether the risk curvature is bounded or not, smaller step-sizes always improve the convergence bound. This can be seen from the parsimonious relationship between $\nu_{n}$ and (32). Fixing $n$ , pushing the value of $\nu_{n}$ down towards zero yields the lower limit

[TABLE]

However, this limit is unattainable as $\nu_{n}$ must be positive in order to decrease the risk. This effect has been observed in practical applications of boosting. Friedman [13] noted improved performance for gradient boosting with the use of a small shrinkage factor $\nu$ . At the same time, it was also noted there was diminishing performance gain as $\nu$ became very small, and this came at the expense of an increased number of boosting iterations. This same phenomenon has also been observed for $L_{2}$ Boosting [11] with componentwise linear learners. It is known that the solution path for $L_{2}$ Boosting closely matches that of lasso as $\nu\rightarrow 0$ . However, the algorithm exhibits cycling behaviour for small $\nu$ , which greatly increases the number of iterations and offsets the performance gain in trying to approximate the lasso (see Ehrlinger and Ishwaran [12]).

*3.4. Formal statements of guarantees. * As a reminder, we have defined the following quantities:

[TABLE]

To simplify the results, we will assume that $n\geq 55$ and also set the shrinkage to satisfy $\nu_{n}^{2}e^{\Psi_{n}}=\log n/(64n^{1/4})$ . Our first guarantee shows that our hazard estimator is consistent if the model is correctly specified.

Proposition 3.

(Consistency under correct model specification). Suppose ${\mathscr{F}}$ contains the true log-hazard function $\log\lambda$ . Then with probability

[TABLE]

we have that $\|{\hat{\lambda}}_{\textrm{boost}}\|_{\infty}$ is bounded and

[TABLE]

Thus ${\hat{\lambda}}_{\textrm{boost}}$ is consistent.

Via the tension between $\varepsilon$ and $J_{{\mathscr{F}}_{\varepsilon}}$ , Proposition 3 captures the trade-off in statistical performance between weak and strong learners in gradient boosting. The advantage of low complexity (weak learners) is reflected in the increased probability of the $L^{2}(\mu)$ -bound holding, with this probability being maximized when $J_{{\mathscr{F}}_{\varepsilon}}\rightarrow 0$ , which generally occurs as $\varepsilon\rightarrow 0$ . However, diametrically opposed to this, we find that the $L^{2}(\mu)$ -bound is minimized by $\varepsilon\rightarrow 1$ , which occurs with the use of stronger learners that are more aligned with the gradient. This same trade-off is also captured by our second guarantee which establishes an oracle inequality for tree learners.

Proposition 4.

(Oracle inequality for tree learners). Suppose $e^{F}\in{\mathscr{F}}$ for $F\in{\mathscr{F}}$ . Then among $\{e^{F}:F\in{\mathscr{F}}\}$ , $\lambda^{*}$ is the best $L^{2}(\mu)$ -approximation to $\lambda$ , that is

[TABLE]

Moreover, ${\hat{\lambda}}_{\textrm{boost}}$ converges to this best approximation $\lambda^{*}$ : With probability

[TABLE]

we have that $\|{\hat{\lambda}}_{\textrm{boost}}\|_{\infty}$ is bounded and

[TABLE]

where $\rho_{\mathscr{F}}^{2}=\|\lambda^{*}-\lambda\|_{\mu,2}^{2}$ is the smallest error one can achieve from using functions in $\{e^{F}:F\in{\mathscr{F}}\}$ to approximate $\lambda$ .

For tree learners, $\lambda^{*}(t,x)$ is constant over each region $B_{j}$ in (6), and its value equals the local average of $\lambda$ over $B_{j}$ ,

[TABLE]

Hence if the $B_{j}$ ’s are small, $\lambda^{*}$ should closely approximate $\lambda$ (recall from Remark 1 that the size of the $B_{j}$ ’s is fixed by the data). To estimate the approximation error $\rho_{\mathscr{F}}$ in terms of $B_{j}$ , suppose that $\lambda$ is sufficiently smooth, e.g. Hölder continuous $|\lambda(t,x)-\lambda(t^{\prime},x^{\prime})|\precsim\|(t-t^{\prime},x-x^{\prime})\|^{b}$ for some $b>0$ . Then since $\inf_{B_{j}}\lambda\leq\lambda^{*}|_{B_{j}}\leq\sup_{B_{j}}\lambda$ ,

[TABLE]

4. A tree-based implementation. Here we describe an implementation of Algorithm 1 using regression trees, whereby the $\varepsilon$ -gradient $g_{{F}_{m}}^{\varepsilon}$ is obtained by growing a tree to satisfy (18) for a pre-specified $\varepsilon$ .

To explain the tree growing process, first observe that the $m$ -th step log-hazard estimator is an additive expansion of CART basis functions. Thus it can be written as

[TABLE]

where $A_{b,l}$ is the $l$ -th leaf region of the $b$ -th tree. Recall from Section 1 that each tree is grown until (18) is satisfied, so the number of leaf nodes $L_{b}$ can vary from tree to tree. The leaf regions are typically large subsets of the time-covariate space $[0,1]\times{\mathscr{X}}$ adaptively determined by the tree growing process (to be discussed shortly). Since each leaf region can be further decomposed into the finer disjoint regions $B_{j}$ in (6), ${F}_{m}(t,x)$ can be rewritten as (33). However, many of these regions will share the same coefficient value, so (33) can be written more compactly as

[TABLE]

where $B_{m,j}^{\prime}$ is the union of contiguous regions whose coefficient equals $c_{m,j}$ . This smooths the hazard estimator ${\hat{\lambda}}_{\textrm{boost}}(t,x)$ over $[0,1]\times{\mathscr{X}}$ , thanks to the regularization imposed by limiting the number of trees (early stopping) and also by the use of weak tree learners. This is unlike the unconstrained hazard MLE ${\lambda_{n}}(t,x)$ defined in (10), which can take on a different value in each region $B_{j}$ , making it prone to overfit the data.

To construct an $\varepsilon$ -gradient $g_{{F}_{m}}^{\varepsilon}$ with $\varepsilon$ -alignment to $g_{{F}_{m}}$ defined by (12),

[TABLE]

the tree splits are adaptively chosen to reduce the $L^{2}({\mu_{n}})$ -approximation error between $g_{{F}_{m}}^{\varepsilon}$ and $g_{{F}_{m}}$ . We implement tree splits for both time and covariates. Specifically, suppose we wish to split a leaf region $A\subseteq[0,1]\times{\mathscr{X}}$ into left and right daughter subregions $A_{1}$ and $A_{2}$ , and assign values $\gamma_{1}$ and $\gamma_{2}$ to them. For example, a split on the $k$ -th covariate could propose left and right daughters such as

[TABLE]

or a split on time $t$ could propose regions

[TABLE]

Now note that $g_{{F}_{m}}$ is constant within each region $B_{j}$ . We denote its value by $g_{{F}_{m}}(t_{B_{j}},x_{B_{j}})$ where $(t_{B_{j}},x_{B_{j}})$ is the centre of $B_{j}$ . Hence the best split of $A$ into $A_{1}$ and $A_{2}$ is the one that minimizes

[TABLE]

where

[TABLE]

represents the $j$ -th pseudo-response, $z_{j}=(t_{B_{j}},x_{B_{j}})$ its covariate and $w_{j}={\mu_{n}}(B_{j})$ its weight. Thus the splits use a weighted least squares criterion, which can be efficiently computed as usual.

We split the tree until (18) is satisfied, resulting in $L_{m}$ leaf nodes ( $L_{m}-1$ splits). As discussed in Section 1, we can always find a deep enough tree that is an $\varepsilon$ -gradient because with enough splits we can recover the gradient $g_{{F}_{m}}$ itself. Recall also that a small value of $\varepsilon$ performs best in practice, and this can be chosen by cross-validating on a set of small-sized candidates: For each one we implement Algorithm 1, and we select the one that minimizes the cross-validated risk ${R}_{n}(F)$ defined in (11). By contrast, the standard boosting algorithm [13] uses cross-validation to select a common number of splits to use for all trees, which does not ensure that each tree is an $\varepsilon$ -gradient.

Regarding the possible split points for the covariates (34), note that the $k$ -th covariate $x^{(k)}=x^{(k)}(t)$ is a time series that is sampled periodically. This yields a set of unique values equal to the union of all of the sampled values for the $n$ observations. In direct analogy to non-functional data boosting, we place candidate split points in-between the sorted values in this set. In other words, splits for covariates only occur at values corresponding to the observed data just as in non-functional boosting.

The resolution for the grid of candidate time splits (35) is set equal to the temporal resolution. For example, the covariate trajectories in the simulation in Section 1 are piecewise constant and may change every 0.002 days. Placing the candidate split points at $\{0.002,0.004,\ldots\}$ days simplifies the exact computation of ${\mu_{n}}(B_{j})$ because every covariate trajectory is constant between these points. Again, notice that the splits for time only occur at values informed by the observed data.

Putting it together, the setup above leverages our insight in (36) by transforming the survival functional data into the data values $\{w_{j},\tilde{y}_{j},z_{j}\}_{j:w_{j}>0}$ , which enables the implementation to proceed like standard gradient boosting for non-functional data. Only the pseudo-response $\tilde{y}_{j}$ in $\{w_{j},\tilde{y}_{j},z_{j}\}$ needs to be updated at each boosting iteration, while the other two do not change. In terms of storage it costs $\mathcal{O}(np|\mathcal{T}|)$ to store $\{w_{j},\tilde{y}_{j},z_{j}\}_{j:w_{j}>0}$ , where $|\mathcal{T}|$ is the cardinality of the set of candidate time splits.999Each $\{w_{j},\tilde{y}_{j},z_{j}\}$ is of dimension $p+3$ and the number of time-covariate regions $B_{j}$ with $w_{j}>0$ is at most $n(|\mathcal{T}|+1)$ . To show the latter, observe that $B_{j}$ will only have $w_{j}={\mu_{n}}(B_{j})>0$ if it is traversed by at least one sample covariate trajectory. Then note that each of the $n$ sample covariate trajectories can traverse at most $|\mathcal{T}|+1$ unique regions. Computationally, choosing a new tree split requires testing $\mathcal{O}(np|\mathcal{T}|)$ candidate splits.101010A sample covariate trajectory can have at most $|\mathcal{T}|$ unique observed values for the $k$ -th covariate $x^{(k)}$ , so there are at most $n|\mathcal{T}|$ candidate splits for $x^{(k)}$ . Thus there are $\mathcal{O}(np|\mathcal{T}|)$ candidate splits for $p$ covariates. The number of candidate splits on time is obviously $|\mathcal{T}|$ . The space and time complexities of the implementation are reasonable given that they are $\mathcal{O}(np)$ for non-functional data boosting: In the functional data setting, each sample can have up to $|\mathcal{T}|$ observations, so $n$ functional data samples is akin to $\mathcal{O}(n|\mathcal{T}|)$ samples in a non-functional data setting.

5. Numerical experiment. We now apply the boosting procedure of Section 1 to a high-dimensional dataset generated from a naturalistic simulation. This allows us to compare the performance of our estimator to existing boosting methods. The simulation is of patient service times in an emergency department (ED), and the hazard function of interest is patient service rate in the ED. The study of patient transitions in an ED queue is an important one in healthcare operations, because without a high resolution model of patient flow dynamics, the ED may be suboptimally utilized which would deny patients of timely critical care.

*5.1. Service rate. * The service rate model used in the simulation is based upon a service time dataset from the ED of an academic hospital in the United States. The dataset contains information on 86,983 treatment encounters from 2014 to early 2015. Recorded for each encounter was: Age, gender, Emergency Severity Index (ESI)111111Level 1 is the most severe (e.g., cardiac arrest) and level 5 is the least (e.g., rash). We removed level 1 patients from the dataset because they were treated in a separate trauma bay., time of day when treatment in the ED ward began, day of week of ED visit, and ward census. The last one represents the total number of occupied beds in the ED ward, which varies over the course of the patient’s stay. Hence it is a time-dependent variable. Lastly, we also have the duration of the patient’s stay (service time).

The service rate function is developed from the data in the following way. First, we apply our nonparametric estimator to the data to perform exploratory analysis. We find that:

The key variables affecting the service rate (based on relative variable importance [13]) are ESI, age, and ward census. In addition, two of the most pronounced interaction terms identified by the tree splits are $(\textrm{AGE}\geq 34,\textrm{ESI}=5)$ and $(\textrm{AGE}\geq 34,\textrm{ESI}\leq 4)$ . 2. 2.

Holding all the variables fixed, the shapes of the estimated service rate function resemble the hazard functions of log-normal distributions. This agrees with the queuing literature that find log-normality to be a reasonable parametric fit for service durations.

Guided by these findings, we specify the service rate $\lambda(t,X(t))$ for the simulation as a log-normal accelerated failure time (AFT) model, and estimate its parameters from data. This yields the service rate

[TABLE]

where $\phi_{l}(\cdot;m,\sigma)$ and $\Phi_{l}(\cdot;m,\sigma)$ are the PDF and CDF of the log-normal distribution with log-mean $m=-1.8$ and log-standard deviation $\sigma=0.74$ . The function $\theta(x)$ captures the dependence of the service rate on the covariates:

[TABLE]

The specification for $\theta(X(t))$ above is a slight modification of the original estimate, with the free parameter $a$ allowing us to study the effect of time-dependent covariates on hazard estimation. When $a=0$ , the service rate does not depend on time-varying covariates, but as $a$ increases, the dependency becomes more and more significant. In the data, the ward census never exceeds 70, so we set the capacity of the simulated ED to 70 as well. The $\min$ operator caps the impact that census can have on the simulated service rate as $a$ grows. The irrelevant covariates $\textrm{NUISANCE}_{1},\cdots,\textrm{NUISANCE}_{43}$ are added to the data in order to assess how boosting performs in high dimensions. We explicitly include them in (38) to remind ourselves that the simulated data is high-dimensional. Forty of the irrelevant variables are generated synthetically as described in the next subsection, while the rest are variables from the original dataset not used in the simulation.

*5.2. Simulation model. * Using (37) and (38), we simulate a naturalistic dataset of 10,000 patient visit histories. The value of $a$ will be varied from 0 to 3 in order to study the impact of time-dependent covariates on hazard estimation. Each patient is associated with a 46-dimensional covariate vector consisting of:

•

The time-varying ward census. The initial value is sampled from its marginal empirical distribution in the original dataset. To simulate its trajectory over a patient’s stay, for every timestep advance of 0.002 days ( $\approx$ 3 minutes), a Bernoulli(0.02) random variable is generated. If it is one, then the census is incremented by a normal random variable with zero mean and standard deviation 10. The result is truncated if it lies outside the range $[1,70]$ , the upper end being the capacity of the ED.

•

The other five time-static covariates in the original dataset. These are sampled from their marginal empirical distributions in the original dataset. Two of the variables (age and ESI) influence the service rate, while the other three are irrelevant.

•

An additional forty time-static covariates that do not affect the service rate (irrelevant covariates). Their values are drawn uniformly from [0,1].

We also generate independent censoring times (rounded to the nearest 0.002 days) for each visit from an exponential distribution. For each simulation, the rate of the exponential distribution is set to achieve an approximate target of 25% censoring.

*5.3. Comparison benchmarks. * When the covariates are static in time, a few software packages are available for performing hazard estimation with tree ensembles. Given that the data is simulated from a log-normal hazard, we compare our nonparametric method to two correctly specified parametric estimators:

The blackboost estimator in the R package mboost [10] provides a tree boosting procedure for fitting the log-normal hazard function. In order to apply this to the simulated data, we make ward census a time-static covariate by fixing it at its initial value. 2. 2.

Transformation forests [16] in the R package trtf can also fit log-normal hazards. Moreover, it allows for left-truncated and right-censored data. Since the ward census variable is simulated to be piecewise constant over time, we can treat each segment as a left-truncated and right-censored observation. Thus for this simulation, transformation forests are able to handle time-dependent covariates with time-static effects. This falls in between the static covariate/static effect blackboost estimator and our fully nonparametric one.

Since the service rate model used in the simulations is in fact log-normal, the benchmark methods above enjoy a significant advantage over our nonparametric one, which is not privy to the true distribution. In fact, when $a=0$ the log-normal hazard (37) depends only on time-static covariates, so the benchmarks should outperform our nonparametric estimator. However, as $a$ grows, we would expect a reversal in relative performance.

To compare the performances of the estimators, we use Monte Carlo integration to evaluate the relative mean squared error

[TABLE]

The Monte Carlo integrations are conducted using an independent test set of 10,000 uncensored patient visit histories. For the test set, ward census is held fixed over time at the initial value, and we use the grid $\{0,0.02,0.04,\cdots,1\}$ for the time integral. The nominator above is then estimated by the average of $\{\lambda(t,x)-\hat{\lambda}(t,x)\}^{2}$ evaluated at the 51 $\times$ 10,000 points of $(t,x)$ . The denominator is estimated in the same manner.

*5.4. Results. * For the implementation of our estimator in Section 1, the value of $\varepsilon$ and the number of trees $\hat{m}$ are jointly determined using ten-fold cross validation. The candidate values we tried for $\varepsilon$ are $\{0.003,0.004,0.005,0.006,0.007\}$ , and we limit $\hat{m}$ to no more than 1,000 trees. A wider range of values can be of course be explored for better performance (at the cost of more computations). As comparison, we run an ad-hoc version of our algorithm in which all trees use the same number of splits, as is the case in standard boosting [13]. This approach does not explicitly ensure that the trees will be $\varepsilon$ -gradients for a pre-specified $\varepsilon$ . The number of splits and the number of trees used in the ad-hoc method are jointly determined using ten-fold cross-validation.

In order to speed up convergence at the $m$ -th iteration for both approaches, instead of using the step-size $\nu_{n}/(m+1)$ of Algorithm 1, we performed line-search within the interval $(0,\nu_{n}/(m+1)]$ . While Lemma 2 shows that a smaller shrinkage $\nu_{n}$ is always better, this comes at the expense of a larger $\hat{m}$ and hence computation time. For simplicity we set $\nu_{n}=1$ for all the experiments here.

For fitting the blackboost estimator, we use the default setting of nu $=0.1$ for the step-size taken at each iteration. The other hyperparameters, mstop (the number of trees) and maxdepth (maximum depth of trees), are chosen to directly minimize the relative MSE on the test set. This of course gives the blackboost estimator an unfair advantage over our estimator, which is on top of the fact that it is based on the same distribution as the true model. Transformation forest (using also the true distribution) is fit using code kindly provided by Professor T. Hothorn.121212In the code 100 trees are used in the forest, which takes about 700 megabytes to store the fitted object when applied to our simulated data.

Variable selection. The relative importance of variables [13] for our estimator are given in Table 1 for all four cases $a=0,1,2,3$ . The four factors that influence the service rate (38) are explicitly listed, while the irrelevant covariates are grouped together in the last column. When $a=0$ , the service rate does not depend on census, and we see that the importance of census and the other irrelevant covariates are at least an order of magnitude smaller than the relevant ones. As $a$ increases, census becomes more and more important as correctly reflected in the table. Across all the cases the importance of the relevant covariates are at least an order of magnitude larger than the others, suggesting that our estimator is able to pick out the influential covariates and largely avoid the irrelevant ones.

Presence of time-dependent covariates. Table 2 presents the relative MSEs for the estimators as the service rate function (38) becomes increasingly dependent on the time-varying census variable. When $a=0$ the service rate depends only on time-static covariates, so as expected, the parametric log-normal benchmarks perform the best when applied to data simulated from a log-normal AFT model.

However, as $a$ increases, the service rate becomes increasingly dependent on census. The corresponding performances of both benchmarks deteriorate dramatically, and is handily outperformed by the proposed estimator. We note that the inclusion of just one time-dependent covariate is enough to degrade the performances of the benchmarks, despite the fact that they have the exact same parametric form as the true model.

Finally we find comparable performance among the ad-hoc boosted estimator and our proposed one, although a slight edge goes to the latter especially in the more difficult simulations with larger $a$ . The results here demonstrate that there is a place in the survival boosting literature for fully nonparametric methods like this one that can flexibly handle time-dependent covariates.

6. Discussion. Our estimator can also potentially be used to evaluate the goodness-of-fit of simpler parametric hazard models. Since our approach is likelihood-based, future work might examine whether model selection frameworks like those in [30] can be extended to cover likelihood functionals. For this, [10] provides some guidance for determining the effective degrees of freedom for the boosting estimator. The ideas in [32] may also be germane.

The implementation presented in Section 1 is one of many possible ways to implement our estimator. We defer the design of a more refined implementation to future research, along with open-source code.

Acknowledgements. The review team provided many insightful comments that significantly improved our paper. We are grateful to Brian Clarke, Jack Hall, Sahand Negahban, and Hongyu Zhao for helpful discussions. Special thanks to Trevor Hastie for early formative discussions. The dataset used in Section 1 was kindly provided by Dr. Kito Lord.

APPENDIX: PROOFS

Proof of Proposition 2

Proof.

Writing

[TABLE]

we can apply (4) to establish the first part of the integral in (13) when $F\in{\mathscr{F}}\cup\{\log\lambda\}$ . To complete the representation, it suffices to show that the point process

[TABLE]

has mean $\int_{B}\lambda d\mu$ , and then apply Campbell’s formula. To this end, write $N(t)=I(T\leq t)$ and consider the filtration $\sigma\{X(s),Y(s),N(s):s\leq t\}$ . Then $N(t)$ has the Doob-Meyer form $dN(t)=\lambda(t,X(t))Y(t)dt+dM(t)$ where $M(t)$ is a martingale. Hence

[TABLE]

where the last equality follows from (4). Since $I[\{t,X(t)\}\in B]$ is predictable because $X(t)$ is, the desired result follows if the stochastic integral $\int_{0}^{1}I[\{t,X(t)\}\in B]dM(t)$ is a martingale. By Section 2 of Aalen [1], this is true if $M(t)$ is square-integrable. In fact, $M(t)=N(t)-\int_{0}^{t}\lambda(t,X(t))dt$ is bounded because $\lambda(t,x)$ is bounded above by (A1). This establishes (13).

Now note that for a positive constant $\Lambda$ the function $e^{y}-\Lambda y$ is bounded below by both $-\Lambda y$ and $\Lambda y+2\Lambda\{1-\log 2\Lambda\}$ , hence $e^{y}-\Lambda y\geq\Lambda|y|+2\Lambda\min\{0,1-\log 2\Lambda\}$ . Since $\Lambda\min\{0,1-\log 2\Lambda\}$ is non-increasing in $\Lambda$ , (A1) implies that

[TABLE]

Integrating both sides and using the norm equivalence relation (27) shows that

[TABLE]

The lower bound (14) then follows from the second inequality. The last inequality shows that $R(F)$ is coercive on $({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{\mu})$ . Moreover the same argument used to derive (8) shows that $R(F)$ is smooth and convex on $({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{\mu})$ . Therefore a unique minimizer $F^{*}$ of $R(F)$ exists in $({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{\mu})$ . Since (A2) implies there is a bijection between the equivalent classes of $({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{\mu})$ and the functions in ${\mathscr{F}}$ , $F^{*}$ is also the unique minimizer of $R(F)$ in ${\mathscr{F}}$ . Finally, since $e^{F(t,x)}-\lambda(t,x)F(t,x)$ is pointwise bounded below by $\lambda(t,x)\{1-\log\lambda(t,x)\}$ , $R(F)\geq\int(\lambda-\lambda\log\lambda)d\mu=R(\log\lambda)$ for all $F\in{\mathscr{F}}$ . ∎

Proof of Lemma 1

Proof.

By a pointwise-measurable argument (Example 2.3.4 of [29]) it can be shown that all suprema quantities appearing below are sufficiently well behaved, so outer integration is not required. Define the Orlicz norm ${\left\|X\right\|_{\Phi}}=\inf\{C>0:\mathbb{E}\Phi(|X|/C)\leq 1\}$ where $\Phi(x)=e^{x^{2}}-1$ . Suppose the following holds:

[TABLE]

where $J_{{\mathscr{F}}_{\varepsilon}}$ is the complexity measure (26), and $\kappa^{\prime},\kappa^{\prime\prime}$ are universal constants. Then by Markov’s inequality, (30) holds with probability at least $1-2\exp[-\{\eta n^{1/4}/(\kappa^{\prime}J_{{\mathscr{F}}_{\varepsilon}})\}^{2}]$ , and

[TABLE]

holds with probability at least $1-2\exp[-\{n^{1/2}/({\alpha_{\!\mathchoice{\raisebox{0.0pt}{$ \displaystyle{\mathscr{F}} $}}{\raisebox{0.0pt}{$ {\mathscr{F}} $}}{\raisebox{-0.85pt}{$ \scriptstyle{\mathscr{F}} $}}{\raisebox{-0.4pt}{$ \scriptscriptstyle{\mathscr{F}} $}}}}\kappa^{\prime\prime}J_{{\mathscr{F}}_{\varepsilon}})\}^{2}]$ . Since ${\alpha_{\!\mathchoice{\raisebox{0.0pt}{$ \displaystyle{\mathscr{F}} $}}{\raisebox{0.0pt}{$ {\mathscr{F}} $}}{\raisebox{-0.85pt}{$ \scriptstyle{\mathscr{F}} $}}{\raisebox{-0.4pt}{$ \scriptscriptstyle{\mathscr{F}} $}}}}>1$ and $\eta<1$ , (30) and (41) jointly hold with probability at least $1-4\exp[-\{\eta n^{1/4}/(\kappa{\alpha_{\!\mathchoice{\raisebox{0.0pt}{$ \displaystyle{\mathscr{F}} $}}{\raisebox{0.0pt}{$ {\mathscr{F}} $}}{\raisebox{-0.85pt}{$ \scriptstyle{\mathscr{F}} $}}{\raisebox{-0.4pt}{$ \scriptscriptstyle{\mathscr{F}} $}}}}J_{{\mathscr{F}}_{\varepsilon}})\}^{2}]$ . The lemma then follows if (41) implies (29). Indeed, for any non-zero $F\in{\mathscr{F}}$ , its normalization $G=F/\|F\|_{\infty}$ is in ${\mathscr{F}}_{\varepsilon}$ by construction (25). Then (41) implies that

[TABLE]

because

[TABLE]

where the last inequality follows from the definition of ${\alpha_{\!\mathchoice{\raisebox{0.0pt}{$ \displaystyle{\mathscr{F}} $}}{\raisebox{0.0pt}{$ {\mathscr{F}} $}}{\raisebox{-0.85pt}{$ \scriptstyle{\mathscr{F}} $}}{\raisebox{-0.4pt}{$ \scriptscriptstyle{\mathscr{F}} $}}}}$ (28).

Thus it remains to establish (39) and (40), which can be done by applying the symmetrization and maximal inequality results in Sections 2.2 and 2.3.2 of [29]. Write ${R}_{n}(F)=(1/n)\sum_{i=1}^{n}l_{i}(F)$ where $l_{i}(F)=\int_{0}^{1}Y_{i}(t)e^{F(t,X_{i}(t))}dt-\Delta_{i}F(T_{i},X_{i}(T_{i}))$ are independent copies of the loss

[TABLE]

which is a stochastic process indexed by $F\in{\mathscr{F}}$ . As was shown in Proposition 2, $\mathbb{E}\{l(F)\}=R(F)$ . Let $\zeta_{1},\cdots,\zeta_{N}$ be independent Rademacher random variables that are independent of $Z=\{(X_{i}(\cdot),Y_{i}(\cdot),T_{i})\}_{i=1}^{n}$ . It follows from the symmetrization Lemma 2.3.6 of [29] for stochastic processes that the left hand side of (39) is bounded by twice the Orlicz norm of

[TABLE]

Now hold $Z$ fixed so that only $\zeta_{1},\cdots,\zeta_{n}$ are stochastic, in which case the sum in the second line of (1) becomes a separable subgaussian process. Since the Orlicz norm of $\sum_{i=1}^{n}\zeta_{i}a_{i}$ is bounded by $(6\sum_{i=1}^{n}a_{i}^{2})^{1/2}$ for any constant $a_{i}$ , we obtain the following the Lipschitz property for any $F_{1},F_{2}\in{\mathscr{F}}_{\varepsilon}^{\Psi_{n}}$ :

[TABLE]

where the second inequality follows from $|e^{x}-e^{y}|\leq e^{\max(x,y)}|x-y|$ and the last from the Cauchy-Schwarz inequality. Putting the Lipschitz constant $(6n)^{1/2}e^{\Psi_{n}}$ obtained above into Theorem 2.2.4 of [29] yields the following maximal inequality: There is a universal constant $\kappa^{\prime}$ such that

[TABLE]

where the last line follows from (26). Likewise the conditional Orlicz norm for the supremum of $\left|\sum_{i=1}^{n}\zeta_{i}\Delta_{i}F(T_{i},X_{i}(T_{i}))\right|$ is bounded by $\kappa^{\prime}J_{{\mathscr{F}}_{\varepsilon}}n^{1/2}\Psi_{n}$ . Since neither bounds depend on $Z$ , plugging back into (1) establishes (39):

[TABLE]

where $\Psi_{n}e^{\Psi_{n}}=n^{1/4}$ by (19). On noting that

[TABLE]

(40) can be established using the same approach. ∎

Proof of Lemma 2

Proof.

For $m<{\hat{m}}$ , applying (8) to ${R}_{n}({F}_{m+1})={R}_{n}({F}_{m}-\frac{\nu_{n}}{m+1}g_{{F}_{m}}^{\varepsilon})$ yields

[TABLE]

where the bound for the second term is due to (18) and the bound for the integral follows from $\int(g_{{F}_{m}}^{\varepsilon})^{2}d{\mu_{n}}=1$ (Definition 1 of an $\varepsilon$ -gradient) and $\|{F}_{m}\|_{\infty},\|{F}_{m+1}\|_{\infty}<\Psi_{n}$ for $m<{\hat{m}}$ (lines 5-6 of Algorithm 1). Hence for $m\leq{\hat{m}}$ , (44) implies that

[TABLE]

because $\nu_{n}^{2}e^{\Psi_{n}}<1$ under (20). Since $\max_{m\leq\hat{m}}\|{F}_{m}\|_{\infty}<\Psi_{n}$ , and using our assumption $\sup_{F\in{\mathscr{F}}_{\varepsilon}^{\Psi_{n}}}|{R}_{n}(F)-R(F)|<1$ in the statement of the lemma, we have

[TABLE]

Clearly the minimizer $F^{*}$ also satisfies $R(F^{*})\leq R(0)<3$ . Thus coercivity (15) implies that

[TABLE]

so the gap $\hat{\gamma}$ defined in (31) is bounded as claimed.

It remains to establish (32), for which we need only consider the case ${R}_{n}({F}_{\hat{m}})-{R}_{n}(F^{*})>0$ . The termination criterion $g_{{F}_{m}}=0$ in Algorithm 1 is never triggered under this scenario, because by Proposition 1 this would imply that ${F}_{{\hat{m}}}$ minimizes ${R}_{n}(F)$ over the span of $\{{\varphi}_{nj}(t,x)\}_{j}$ , which also contains $F^{*}$ (Remark 3). Thus either ${\hat{m}}=\infty$ , or the termination criterion $\|{F}_{{\hat{m}}}-\frac{\nu_{n}}{{\hat{m}}+1}\hat{g}_{{F}_{{\hat{m}}}}^{\varepsilon}\|_{\infty}\geq\Psi_{n}$ in line 5 of Algorithm 1 is met. In the latter case

[TABLE]

where the inequalities follow from (29) and from $\|g_{{F}_{m}}^{\varepsilon}\|_{{\mu_{n}},2}=1$ . Since the sum is diverging, the inequality also holds for ${\hat{m}}$ sufficiently large (e.g. ${\hat{m}}=\infty$ ).

Given that $F^{*}$ lies in the span of $\{{\varphi}_{nj}(t,x)\}_{j}$ , the Taylor expansion (8) is valid for ${R}_{n}(F^{*})$ . Since the remainder term in the expansion is non-negative, we have

[TABLE]

Furthermore for $m\leq\hat{m}$ ,

[TABLE]

Putting both into (44) gives

[TABLE]

Subtracting ${R}_{n}(F^{*})$ from both sides above and denoting $\delta_{m}={R}_{n}({F}_{m})-{R}_{n}(F^{*})$ , we obtain

[TABLE]

Since the term inside the first parenthesis is between 0 and 1, solving the recurrence yields

[TABLE]

where in the second inequality we used the fact that $0\leq 1+y\leq e^{y}$ for $|y|<1$ , and the last line follows from (45).

The Lambert function (19) in $\Psi_{n}=W(n^{1/4})$ is asymptotically $\log y-\log\log y$ , and in fact by Theorem 2.1 of [14], $W(y)\geq\log y-\log\log y$ for $y\geq e$ . Since by assumption $n\geq 55>e^{4}$ , the above becomes

[TABLE]

The last step is to control $\delta_{0}$ , which is bounded by $1-{R}_{n}(F^{*})$ because ${R}_{n}({F}_{0})={R}_{n}(0)\leq 1$ . Then under the hypothesis $|{R}_{n}(F^{*})-R(F^{*})|<1$ , we have

[TABLE]

Since (14) implies $R(F^{*})\geq 2\Lambda_{U}\min\{0,1-\log(2\Lambda_{U})\}$ ,

[TABLE]

∎

Proof of Proposition 3

Proof.

Let $\delta=\log n/(4n^{1/4})$ which is less than one for $n\geq 55>e^{4}$ . Since ${\alpha_{\!\mathchoice{\raisebox{0.0pt}{$ \displaystyle{\mathscr{F}} $}}{\raisebox{0.0pt}{$ {\mathscr{F}} $}}{\raisebox{-0.85pt}{$ \scriptstyle{\mathscr{F}} $}}{\raisebox{-0.4pt}{$ \scriptscriptstyle{\mathscr{F}} $}}}},\hat{\gamma}\geq 1$ it follows that

[TABLE]

Now define the following probability sets

[TABLE]

and fix a sample realization from $\cap_{k=1}^{4}S_{k}$ . Then the conditions required in Lemma 2 are satisfied with $\sup_{F\in{\mathscr{F}}_{\varepsilon}^{\Psi_{n}}}|{R}_{n}(F)-R(F)|<2\delta/3$ , so $\hat{\gamma}$ (and hence $\|{\hat{\lambda}}_{\textrm{boost}}\|_{\infty}$ ) is bounded and (32) holds. Since Algorithm 1 ensures that $\|{F}_{{\hat{m}}}\|_{\infty}<\Psi_{n}$ , we have ${F}_{{\hat{m}}}\in{\mathscr{F}}_{\varepsilon}^{\Psi_{n}}$ and therefore it also follows that $|{R}_{n}({F}_{{\hat{m}}})-R({F}_{{\hat{m}}})|<2\delta/3$ . Combining (23) and (1) gives

[TABLE]

where the second inequality follows from (32) and $\nu_{n}^{2}e^{\Psi_{n}}=\log n/(64n^{1/4})$ , and the last from (46). Now, using the inequality $|e^{x}-e^{y}|\leq\max(e^{x},e^{y})|x-y|$ yields

[TABLE]

and the stated bound follows from $F^{*}=\log\lambda$ since ${\mathscr{F}}$ is correctly specified (Proposition 2).

The next task is to lower bound $\mathbb{P}(\cap_{k=1}^{4}S_{k})$ . It follows from Lemma 1 that

[TABLE]

Bounds on $\mathbb{P}(S_{2})$ and $\mathbb{P}(S_{3})$ can be obtained using Hoeffding’s inequality. Note from (2) that ${R}_{n}(0)=\sum_{i=1}^{n}\int_{0}^{1}Y_{i}(t)dt/n$ and ${R}_{n}(F^{*})=\sum_{i=1}^{n}l_{i}(F^{*})/n$ for the loss $l(\cdot)$ defined in (42). Since $0\leq\int_{0}^{1}Y_{i}(t)dt\leq 1$ and $-\|F^{*}\|_{\infty}<l(F^{*})\leq\|e^{F^{*}}\|_{\infty}+\|F^{*}\|_{\infty}$ ,

[TABLE]

By increasing the value of $\kappa$ and/or replacing $J_{{\mathscr{F}}_{\varepsilon}}$ with $\max(1,J_{{\mathscr{F}}_{\varepsilon}})$ if necessary, we can combine the inequalities to get a crude but compact bound:

[TABLE]

Finally, since $\|F^{*}\|_{\infty}=\|\log\lambda\|_{\infty}<\max\{|\log\Lambda_{L}|,|\log\Lambda_{U}|\}$ , we can replace $e^{\|F^{*}\|_{\infty}}$ in the probability bound above by $\Lambda_{L}^{-1}\vee\Lambda_{U}$ . ∎

Proof of Proposition 4

Proof.

It follows from (22) that $\lambda^{*}$ is the orthogonal projection of $\lambda$ onto $({\mathscr{F}},\left\langle\cdot,\cdot\right\rangle_{\mu})$ . Hence

[TABLE]

where the inequality follows from $|e^{x}-e^{y}|\leq\max(e^{x},e^{y})|x-y|$ . Bounding the last term in the same way as Proposition 3 completes the proof. To replace $e^{\|F^{*}\|_{\infty}}$ in (47) by $\Lambda_{L}^{-1}\vee\Lambda_{U}$ , it suffices to show that $\Lambda_{L}\leq\lambda^{*}(t,x)\leq\Lambda_{U}$ . Since the value of $\lambda^{*}$ over one of its piecewise constant regions $B$ is $\int_{B}\lambda d\mu/\mu(B)$ , the desired bound follows from (A1). We can also replace $\max_{t,x}(\lambda^{*}\vee{\hat{\lambda}}_{\textrm{boost}})$ and $\min_{t,x}(\lambda^{*}\wedge{\hat{\lambda}}_{\textrm{boost}})$ with $\max_{t,x}(\Lambda_{U}\vee{\hat{\lambda}}_{\textrm{boost}})$ and $\min_{t,x}(\Lambda_{L}\wedge{\hat{\lambda}}_{\textrm{boost}})$ respectively. ∎

REFERENCES

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aalen [1978] O. O. Aalen. Nonparametric inference for a family of counting processes. Annals of Statistics , 6(4):701–726, 1978.
2Adelson et al. [2018] K. Adelson, D. K. K. Lee, S. Velji, J. Ma, S. Lipka, J. Rimar, P. Longley, T. Vega, J. Perez-Irizarry, E. Pinker, and R. Lilenbaum. Development of Imminent Mortality Predictor for Advanced Cancer (IMPAC), a tool to predict short-term mortality in hospitalized patients with advanced cancer. Journal of Oncology Practice , 14(3):e 168–e 175, 2018.
3Biau and Cadre [2017] G. Biau and B. Cadre. Optimization by gradient boosting. ar Xiv preprint ar Xiv:1707.05023 , 2017.
4Binder and Schumacher [2008] H. Binder and M. Schumacher. Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinformatics , 9(1):14, 2008.
5Blanchard et al. [2003] G. Blanchard, G. Lugosi, and N. Vayatis. On the rate of convergence of regularized boosting classifiers. Journal of Machine Learning Research , 4(Oct):861–894, 2003.
6Breiman [1997] L. Breiman. Arcing the edge. U.C. Berkeley Dept. of Statistics Technical Report , 486, 1997.
7Breiman [1999] L. Breiman. Prediction games and arcing algorithms. Neural Computation , 11(7):1493–1517, 1999.
8Breiman [2004] L. Breiman. Population theory for boosting ensembles. Annals of Statistics , 32(1):1–11, 2004.