Measuring the Algorithmic Convergence of Randomized Ensembles: The   Regression Setting

Miles E. Lopes; Suofei Wu; Thomas C. M. Lee

arXiv:1908.01251·stat.ML·August 6, 2019

Measuring the Algorithmic Convergence of Randomized Ensembles: The Regression Setting

Miles E. Lopes, Suofei Wu, Thomas C. M. Lee

PDF

TL;DR

This paper introduces a bootstrap method to assess whether a randomized ensemble in regression has converged to near-optimal performance, providing practical guarantees and adaptability for variable selection.

Contribution

It develops a bootstrap approach for measuring ensemble convergence in regression, with weaker assumptions and applications to variable selection, complementing prior classification work.

Findings

01

Method effectively measures ensemble convergence in regression.

02

The approach requires weaker assumptions than previous methods.

03

Numerical experiments show strong performance across various scenarios.

Abstract

When randomized ensemble methods such as bagging and random forests are implemented, a basic question arises: Is the ensemble large enough? In particular, the practitioner desires a rigorous guarantee that a given ensemble will perform nearly as well as an ideal infinite ensemble (trained on the same data). The purpose of the current paper is to develop a bootstrap method for solving this problem in the context of regression --- which complements our companion paper in the context of classification (Lopes 2019). In contrast to the classification setting, the current paper shows that theoretical guarantees for the proposed bootstrap can be established under much weaker assumptions. In addition, we illustrate the flexibility of the method by showing how it can be adapted to measure algorithmic convergence for variable selection. Lastly, we provide numerical results demonstrating that the…

Figures12

Click any figure to enlarge with its caption.

Equations288

\overset{ˉ}{T}_{t} (x) := \frac{1}{t} i = 1 \sum t T_{i} (x) .

\overset{ˉ}{T}_{t} (x) := \frac{1}{t} i = 1 \sum t T_{i} (x) .

T_{i} (x) = φ (x; D, ξ_{i}),

T_{i} (x) = φ (x; D, ξ_{i}),

\textsc{mse}_{t}\,:=\,\int_{\mathcal{X}\times\mathbb{R}}\big{(}y-\bar{T}_{t}(x)\big{)}^{2}d\nu(x,y)\,=\,\mathbb{E}\Big{[}(Y-\bar{T}_{t}(X))^{2}\,\Big{|}\,\boldsymbol{\xi}_{t},\mathcal{D}\Big{]},

\textsc{mse}_{t}\,:=\,\int_{\mathcal{X}\times\mathbb{R}}\big{(}y-\bar{T}_{t}(x)\big{)}^{2}d\nu(x,y)\,=\,\mathbb{E}\Big{[}(Y-\bar{T}_{t}(X))^{2}\,\Big{|}\,\boldsymbol{\xi}_{t},\mathcal{D}\Big{]},

q_{1-\alpha}(t)\,:=\,\inf\Big{\{}q\in\mathbb{R}\,\Big{|}\,\mathbb{P}\big{(}\textsc{mse}_{t}-\text{mse}_{\infty}\leq q\,\big{|}\,\mathcal{D}\big{)}\,\geq\,1-\alpha\Big{\}}.

q_{1-\alpha}(t)\,:=\,\inf\Big{\{}q\in\mathbb{R}\,\Big{|}\,\mathbb{P}\big{(}\textsc{mse}_{t}-\text{mse}_{\infty}\leq q\,\big{|}\,\mathcal{D}\big{)}\,\geq\,1-\alpha\Big{\}}.

r_{t}-r_{\infty}\,\leq\,\textstyle\frac{8}{t}\Big{(}\|\mu\|_{\infty}^{2}+\sigma^{2}(1+4\log(n))\Big{)},\vspace{0.1cm}

r_{t}-r_{\infty}\,\leq\,\textstyle\frac{8}{t}\Big{(}\|\mu\|_{\infty}^{2}+\sigma^{2}(1+4\log(n))\Big{)},\vspace{0.1cm}

\textsc m se_{t} - mse_{\infty} \leq q_{1 - α} (t)

\textsc m se_{t} - mse_{\infty} \leq q_{1 - α} (t)

ψ (f) = \int_{X \times R} (y - f (x))^{2} d ν (x, y),

ψ (f) = \int_{X \times R} (y - f (x))^{2} d ν (x, y),

\textsc m se_{t} = ψ (\overset{ˉ}{T}_{t}) .

\textsc m se_{t} = ψ (\overset{ˉ}{T}_{t}) .

ϑ (x) := E [\overset{ˉ}{T}_{t} (x) ∣ D],

ϑ (x) := E [\overset{ˉ}{T}_{t} (x) ∣ D],

\textsc m se_{t} - mse_{\infty} = ψ (\overset{ˉ}{T}_{t}) - ψ (ϑ) .

\textsc m se_{t} - mse_{\infty} = ψ (\overset{ˉ}{T}_{t}) - ψ (ϑ) .

\textsc m se_{t}^{*} - \textsc m se_{t} := ψ (\overset{ˉ}{T}_{t}^{*}) - ψ (\overset{ˉ}{T}_{t}),

\textsc m se_{t}^{*} - \textsc m se_{t} := ψ (\overset{ˉ}{T}_{t}^{*}) - ψ (\overset{ˉ}{T}_{t}),

ψ (\overset{ˉ}{T}_{t}) = \frac{1}{m} j = 1 \sum m (\tilde{Y}_{j} - \overset{ˉ}{T}_{t} (\tilde{X}_{j}))^{2} .

ψ (\overset{ˉ}{T}_{t}) = \frac{1}{m} j = 1 \sum m (\tilde{Y}_{j} - \overset{ˉ}{T}_{t} (\tilde{X}_{j}))^{2} .

ψ_{\textsc o} (\overset{ˉ}{T}_{t}) = \frac{1}{n} j = 1 \sum n (Y_{j} - \overset{ˉ}{T}_{t, \textsc o} (X_{j}))^{2},

ψ_{\textsc o} (\overset{ˉ}{T}_{t}) = \frac{1}{n} j = 1 \sum n (Y_{j} - \overset{ˉ}{T}_{t, \textsc o} (X_{j}))^{2},

\overset{ˉ}{T}_{t, \textsc o} (X_{j}) = \frac{1}{∣ \textsc oo b ( X _{j} ) ∣} i \in \textsc oo b (X_{j}) \sum T_{i} (X_{j}),

\overset{ˉ}{T}_{t, \textsc o} (X_{j}) = \frac{1}{∣ \textsc oo b ( X _{j} ) ∣} i \in \textsc oo b (X_{j}) \sum T_{i} (X_{j}),

\overline{\textsc v i}_{t} = \frac{1}{t} i = 1 \sum t \textsc v i_{i} .

\overline{\textsc v i}_{t} = \frac{1}{t} i = 1 \sum t \textsc v i_{i} .

ε_{t} := 1 \leq l \leq p max ∣ \overline{\textsc v i}_{t} (l) - vi_{\infty} (l) ∣,

ε_{t} := 1 \leq l \leq p max ∣ \overline{\textsc v i}_{t} (l) - vi_{\infty} (l) ∣,

{\tt{q}}_{1-\alpha}(t)\,:=\,\inf\Big{\{}q\in[0,\infty)\ \bigg{|}\ \mathbb{P}\big{(}\varepsilon_{t}\leq q\,\big{|}\mathcal{D}\big{)}\,\geq\,1-\alpha\Big{\}}.

{\tt{q}}_{1-\alpha}(t)\,:=\,\inf\Big{\{}q\in[0,\infty)\ \bigg{|}\ \mathbb{P}\big{(}\varepsilon_{t}\leq q\,\big{|}\mathcal{D}\big{)}\,\geq\,1-\alpha\Big{\}}.

\textsc m se_{t} - mse_{\infty} \leq q_{1 - α} (t)

\textsc m se_{t} - mse_{\infty} \leq q_{1 - α} (t)

⟨ g, h ⟩ = \int_{X \times R} g (x, y) h (x, y) d ν (x, y),

⟨ g, h ⟩ = \int_{X \times R} g (x, y) h (x, y) d ν (x, y),

ζ = 2 ⟨ ϑ - y, T_{1} - ϑ ⟩,

ζ = 2 ⟨ ϑ - y, T_{1} - ϑ ⟩,

σ (D) = var (ζ ∣ D),

σ (D) = var (ζ ∣ D),

\beta_{k}(\mathcal{D})=\max\Big{\{}\big{(}\mathbb{E}\big{[}\|T_{1}-y\|_{L_{2}}^{2k}\big{|}\mathcal{D}\big{]}\big{)}^{1/k}\,,\,\big{(}\mathbb{E}\big{[}\|T_{1}-\vartheta\|_{L_{2}}^{2k}|\mathcal{D}\big{]}\big{)}^{1/k}\Big{\}},

\beta_{k}(\mathcal{D})=\max\Big{\{}\big{(}\mathbb{E}\big{[}\|T_{1}-y\|_{L_{2}}^{2k}\big{|}\mathcal{D}\big{]}\big{)}^{1/k}\,,\,\big{(}\mathbb{E}\big{[}\|T_{1}-\vartheta\|_{L_{2}}^{2k}|\mathcal{D}\big{]}\big{)}^{1/k}\Big{\}},

β_{k} (D) \leq 4 M (D)^{2},

β_{k} (D) \leq 4 M (D)^{2},

\delta_{t,k,B}(\mathcal{D})\ :=\ \textstyle\frac{k^{2}}{\sqrt{t}}\Big{(}\textstyle\frac{\beta_{3k}(\mathcal{D})}{\sigma(\mathcal{D})}\Big{)}^{3}\ +\ e^{-k/2}\ +\ \sqrt{\frac{\log(B)}{B}}.

\delta_{t,k,B}(\mathcal{D})\ :=\ \textstyle\frac{k^{2}}{\sqrt{t}}\Big{(}\textstyle\frac{\beta_{3k}(\mathcal{D})}{\sigma(\mathcal{D})}\Big{)}^{3}\ +\ e^{-k/2}\ +\ \sqrt{\frac{\log(B)}{B}}.

\mathbb{P}\Big{(}\textsc{mse}_{t}-\textup{mse}_{\infty}\leq\,\widehat{q}_{1-\alpha}(t)\,\Big{|}\,\mathcal{D}\Big{)}\ \geq\ 1-\alpha-c_{0}\,\delta_{t,k,B}(\mathcal{D}).

\mathbb{P}\Big{(}\textsc{mse}_{t}-\textup{mse}_{\infty}\leq\,\widehat{q}_{1-\alpha}(t)\,\Big{|}\,\mathcal{D}\Big{)}\ \geq\ 1-\alpha-c_{0}\,\delta_{t,k,B}(\mathcal{D}).

k = ⌈ lo g (t) - 4 lo g lo g (t)⌉,

k = ⌈ lo g (t) - 4 lo g lo g (t)⌉,

e^{- k /2} \leq \frac{lo g ( t ) ^{2}}{t} and \frac{k ^{2}}{t} \leq \frac{c _{0} lo g ( t ) ^{2}}{t},

e^{- k /2} \leq \frac{lo g ( t ) ^{2}}{t} and \frac{k ^{2}}{t} \leq \frac{c _{0} lo g ( t ) ^{2}}{t},

δ_{t, k, B} (D) \leq \frac{c ( D ) lo g ( t ) ^{2}}{t} + \frac{lo g ( B )}{B},

δ_{t, k, B} (D) \leq \frac{c ( D ) lo g ( t ) ^{2}}{t} + \frac{lo g ( B )}{B},

B = O (p \cdot d),

B = O (p \cdot d),

B = O (n \cdot d),

B = O (n \cdot d),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Measuring the Algorithmic Convergence of Randomized Ensembles:

The Regression Setting

Miles E. Lopeslabel=e1] [

Suofei Wu

Thomas C. M. Lee label=e3]??? [ University of California, Davis

University of California, Davis

Abstract

When randomized ensemble methods such as bagging and random forests are implemented, a basic question arises: Is the ensemble large enough? In particular, the practitioner desires a rigorous guarantee that a given ensemble will perform nearly as well as an ideal infinite ensemble (trained on the same data). The purpose of the current paper is to develop a bootstrap method for solving this problem in the context of regression — which complements our companion paper in the context of classification (Lopes, 2019). In contrast to the classification setting, the current paper shows that theoretical guarantees for the proposed bootstrap can be established under much weaker assumptions. In addition, we illustrate the flexibility of the method by showing how it can be adapted to measure algorithmic convergence for variable selection. Lastly, we provide numerical results demonstrating that the method works well in a range of situations.

62F40 ,

65B05, 68W20, 60G25 ,

random forests, bagging, bootstrap, randomized algorithms,

keywords:

[class=MSC]

keywords:

\startlocaldefs\endlocaldefs

t1Supported in part by NSF grant DMS 1613218. t3Supported in part by NSF grants DMS 1811405 and DMS 1811661.

1 Introduction

Ensemble methods are a fundamental approach to prediction, based on the principle that accuracy can be enhanced by aggregating a diverse collection of prediction functions. Two of the most widely used methods in this class are random forests and bagging, which rely on randomization as a general way to diversify an ensemble (Breiman, 1996, 2001). For these types of randomized ensembles, it is generally understood that the predictive accuracy improves and eventually stabilizes as the ensemble size becomes large. Likewise, in the theoretical analysis of randomized ensembles, it is common to focus on the idealized case of an infinite ensemble (Bühlmann and Yu, 2002; Hall and Samworth, 2005; Biau et al., 2008; Biau, 2012; Scornet et al., 2015). However, in practice, the user does not know the true relationship between accuracy and ensemble size, and as a result, it is difficult to know if an ensemble is sufficiently large.

The purpose of the current paper is develop a solution to this problem for random forests, bagging, and related methods in the context of regression. More specifically, we offer a bootstrap method for estimating how far the prediction error of a finite ensemble is from the ideal prediction error of an infinite ensemble (trained on the same data). A precise description of the setup and problem formulation is given as follows.

1.1 Background and setup

To fix some basic notation for the regression setting, let $\mathcal{D}=\{(X_{j},Y_{j})\}_{j=1}^{n}$ denote a set of training data in a space $\mathcal{X}\times\mathbb{R}$ , where each $Y_{j}$ is the scalar response variable associated to $X_{j}$ , and the space $\mathcal{X}$ is arbitrary. Also, an ensemble of $t$ regression functions trained on $\mathcal{D}$ is denoted as $T_{i}:\mathcal{X}\to\mathbb{R}$ , where $i=1,\dots,t$ , and the number $t$ is referred to as the ensemble size.

Randomized regression ensembles.

For the purpose of understanding our setup, it is helpful to quickly review the methods of bagging and random forests. The method of bagging works by generating random sets $\mathcal{D}_{1}^{*},\dots,\mathcal{D}_{t}^{*}$ , each of size $n$ , by sampling with replacement from $\mathcal{D}$ . Next, a standard “base” regression algorithm is used to train a regression function $T_{i}$ on $\mathcal{D}_{i}^{*}$ for each $i=1,\dots,t$ . For instance, it is especially common to apply a decision tree algorithm like CART (Breiman et al., 1984) to each set $\mathcal{D}_{i}^{*}$ . In turn, future predictions are made by using the averaged regression function, which is defined for each $x\in\mathcal{X}$ by

[TABLE]

Much like bagging, the method of random forests uses sampling with replacement to generate the same type of random sets $\mathcal{D}_{1}^{*},\dots,\mathcal{D}_{t}^{*}$ . However, random forests adds an additional source of randomness when the base regression algorithm is applied to each $\mathcal{D}_{i}^{*}$ . Namely, in the standard case when $\mathcal{X}\subset\mathbb{R}^{p}$ and CART is the base regression algorithm, random forests uses randomly chosen subsets of the $p$ features when “split points” are selected for the CART regression trees. Likewise, random forests also uses the average (1.1) when making final predictions. A more detailed description may be found in Friedman et al. (2001).

In order to unify the methods of bagging and random forests within a common theoretical framework, our analysis will consider a more general class of randomized ensembles. This class consists of regression functions $T_{1},\dots,T_{t}$ that can be represented in the abstract form

[TABLE]

where $\xi_{1},\dots,\xi_{t}$ are i.i.d. “randomizing parameters” generated independently of $\mathcal{D}$ , and $\varphi$ is a deterministic function that does not depend on $n$ or $t$ . In particular, the representation (1.2) implies that the random functions $T_{1},\dots,T_{t}$ are conditionally i.i.d., given $\mathcal{D}$ . To see why bagging is representable in this form, note that $\xi_{i}$ can be viewed as a random vector that specifies which points in $\mathcal{D}$ are randomly sampled into $\mathcal{D}_{i}^{*}$ . Similarly, in the case of random forests, each $\xi_{i}$ encodes the points in $\mathcal{D}_{i}^{*}$ , as well as randomly chosen sets of features used for training $T_{i}$ . More generally, the representation (1.2) is relevant to other types of randomized ensembles, such as those based on random rotations (Blaser and Fryzlewicz, 2016), random projections (Cannings and Samworth, 2017), or posterior sampling (Ng and Jordan, 2001; Chipman et al., 2010).

Algorithmic convergence.

In our analysis of algorithmic convergence, we will focus on quantifying how the mean-squared error (MSE) of an ensemble behaves as the ensemble size $t$ becomes large. To define this measure of error in more precise terms, let $\boldsymbol{\xi}_{t}:=(\xi_{1},\dots,\xi_{t})$ denote the randomizing parameters of the ensemble, and let $\nu=\mathcal{L}(X,Y)$ denote the joint distribution of a test point $(X,Y)\in\mathcal{X}\times\mathbb{R}$ , which is drawn independently of $\mathcal{D}$ and $\boldsymbol{\xi}_{t}$ . Accordingly, we define

[TABLE]

where the expectation on the right is only over the test point $(X,Y)$ . In this definition, it is important to notice that $\textsc{mse}_{t}$ is a random variable that depends on both $\boldsymbol{\xi}_{t}$ and $\mathcal{D}$ . However, due to the fact that the algorithmic fluctuations of $\textsc{mse}_{t}$ arise only from $\boldsymbol{\xi}_{t}$ , we will view the set $\mathcal{D}$ as a fixed input to the training algorithm, and likewise, our analysis will always be conditional on $\mathcal{D}$ . Indeed, the conditioning on $\mathcal{D}$ is motivated by the fact that the user would like to assess convergence for the particular set $\mathcal{D}$ that they actually have, and this approach has been adopted in several other analyses of algorithmic convergence for randomized ensembles (Ng and Jordan, 2001; Lopes, 2016; Scornet, 2016a; Cannings and Samworth, 2017; Lopes, 2019).

As a way of illustrating algorithmic convergence, Figure 1 shows how $\textsc{mse}_{t}$ evolves when the random forests method is applied to a fixed training set $\mathcal{D}$ . More specifically, if $\text{mse}_{\infty}$ denotes the limit of $\textsc{mse}_{t}$ as $t\to\infty$ , then the left panel displays successive values of the convergence gap $\textsc{mse}_{t}-\text{mse}_{\infty}$ as decision trees are added during a single run of random forests, from $t=1$ up to $t=2,\!000$ . After this entire process is repeated 1,000 times on the same set $\mathcal{D}$ , we obtain many overlapping sample paths, as shown in the right panel of Figure 1. (Note also that none of these curves are observable in practice, and the figure is given only for illustration.)

From a practical standpoint, the user would like to know the size of the convergence gap $\textsc{mse}_{t}-\text{mse}_{\infty}$ as a function of $t$ . For this purpose, it is useful to consider the $(1-\alpha)$ -quantile of $\textsc{mse}_{t}-\text{mse}_{\infty}$ , which is defined for any $\alpha\in(0,1)$ by

[TABLE]

In other words, the value $q_{1-\alpha}(t)$ is the tightest possible upper bound on the gap that holds with probability at least $1-\alpha$ , conditionally on the set $\mathcal{D}$ . This interpretation of $q_{1-\alpha}(t)$ can also be understood from the right panel of Figure 1, where we have plotted $q_{1-\alpha}(t)$ in gray, with $\alpha=1/10$ .

The problem to be solved.

Although it is clear that the quantile $q_{1-\alpha}(t)$ represents a precise measure of algorithmic convergence, this function is unknown in practice. This leads to the problem of estimating $q_{1-\alpha}(t)$ , which we propose to solve.

Beyond the fact that $q_{1-\alpha}(t)$ is unknown, it is also important to keep in mind that estimating $q_{1-\alpha}(t)$ involves some additional constraints. First, the user would like to be able to assess convergence from the output a single run of the ensemble method, whereas the function $q_{1-\alpha}(t)$ describes the fluctuations of $\textsc{mse}_{t}-\text{mse}_{\infty}$ over repeated runs, as illustrated in the right panel of Figure 1. Hence, at first sight, it is not obvious that the output of a single run provides enough information to successfully estimate $q_{1-\alpha}(t)$ . Second, the method for estimating $q_{1-\alpha}(t)$ should be computationally inexpensive, so that the cost of checking convergence is manageable in comparison to the cost of training the ensemble itself. Later on, we will show that the proposed method is able to handle both of these constraints, in Sections 2 and 4 respectively.

1.2 Related work and contributions

The general problem of measuring the algorithmic convergence of randomized ensembles has attracted sustained interest over the past two decades. In particular, there have been numerous empirical studies of algorithmic convergence for both classification and regression (e.g. Latinne et al., 2001; Basilico et al., 2011; Schwing et al., 2011; Oshiro et al., 2012; Probst and Boulesteix, 2018).

With regard to the theoretical analysis of convergence, we will now review the existing results for classification and regression separately. In the setting of classification, much of the literature has studied convergence in terms of the misclassification probability for majority voting, denoted $\textsc{err}_{t}$ (a counterpart of $\textsc{mse}_{t})$ , which is viewed as a random variable that depends on $\boldsymbol{\xi}_{t}$ and $\mathcal{D}$ . For this measure of error, the convergence of $\mathbb{E}[\textsc{err}_{t}|\mathcal{D}]$ and $\operatorname{var}(\textsc{err}_{t}|\mathcal{D})$ as $t\to\infty$ has been analyzed in the papers (Ng and Jordan, 2001; Lopes, 2016; Cannings and Samworth, 2017), which have developed asymptotic formulas for $\mathbb{E}[\textsc{err}_{t}|\mathcal{D}]$ , as well as bounds for $\operatorname{var}(\textsc{err}_{t}|\mathcal{D})$ . Related results for a different measure of error can also be found in Hernández-Lobato et al. (2013). More recently, our companion paper (Lopes, 2019) has developed a bootstrap method for measuring the convergence of $\textsc{err}_{t}$ , which is able to circumvent some of the limitations of analytical results.

In the setting of regression, algorithmic convergence results on $\textsc{mse}_{t}$ are scarce in comparison to those for $\textsc{err}_{t}$ . Instead, much more attention in the regression literature has focused on how the size of $t$ influences the variance of point predictions $\bar{T}_{t}(x)$ , with $x\in\mathcal{X}$ held fixed (e.g., Sexton and Laake, 2009; Arlot and Genuer, 2014; Wager et al., 2014; Mentch and Hooker, 2016; Scornet, 2016a). To the best of our knowledge, the only paper that has systematically studied algorithmic convergence in terms of an error measure is (Scornet, 2016a), which considers the risk $r_{t}:=\mathbb{E}[(\bar{T}_{t}(X)-\mu(X))^{2}]$ , where $\mu(x):=\mathbb{E}[Y|X=x]$ is the true regression function, and the expectation in the definition of $r_{t}$ is over $(X,\mathcal{D},\boldsymbol{\xi}_{t})$ . In particular, the paper (Scornet, 2016a) develops an elegant non-asymptotic bound on the gap between $r_{t}$ and its limiting value $r_{\infty}$ as $t\to\infty$ . Under the assumption of a Gaussian regression model with $\mathcal{X}=[0,1]^{p}$ , this bound has the form

[TABLE]

where $\sigma^{2}=\operatorname{var}(Y)$ , and $\|\mu\|_{\infty}:=\sup_{x\in\mathcal{X}}|\mu(x)|$ . In addition to this bound, the paper (Scornet, 2016a) gives further insight into algorithmic convergence by developing a precise uniform central limit theorem for $\bar{T}_{t}$ as $t\to\infty$ , with $\mathcal{D}$ held fixed. More specifically, this limit theorem demonstrates that under certain conditions, the standardized process $\sqrt{t}(\bar{T}_{t}(\cdot)-\mathbb{E}[\bar{T}_{t}(\cdot)|\mathcal{D}])$ converges in distribution (conditionally on $\mathcal{D}$ ) to a Gaussian process on $\mathcal{X}$ .

Contributions.

From a methodological standpoint, the approach taken here differs in several ways from previous works in the regression setting. Most notably, our work looks at algorithmic convergence in terms of an error measure that is conditional on $\mathcal{D}$ . (For instance, this differs from the analysis of $r_{t}$ , which averages over $\mathcal{D}$ .) In particular, we provide a quantile estimate $\widehat{q}_{1-\alpha}(t)$ , such that the bound

[TABLE]

holds with a probability that is effectively $1-\alpha$ , conditionally on $\mathcal{D}$ . This conditioning is especially important from the viewpoint of the user, who is typically interested in convergence with respect to the actual dataset at hand. Another distinct feature of our method is that it provides the user with a direct numerical estimate of convergence, whereas formula-based results are more likely to involve conservative constants, or depend on unknown parameters, such as $\|\mu\|_{\infty}$ or $\sigma$ in the bound (1.4).

In addition, the scope of the proposed method goes beyond $\textsc{mse}_{t}$ , and in Section 2.2 we will show how the bootstrap method is flexible enough that it can also be applied to variable selection. In this context, the ensemble provides a ranking of variables according to an “importance measure”, and this ranking typically stabilizes as $t\to\infty$ . However, the notion of convergence is somewhat subtle, because it is possible that the importance measure for some variables may converge more slowly than for others — which can distort the overall ranking of variables. As far as we know, this issue has not be addressed in the literature, and the method proposed in Section 2.2 provides a way to check that convergence has been achieved uniformly across all variables, so that they can be compared fairly.

With regard to theory, the most important aspects of our analysis is that it is based on very mild assumptions. To place our assumptions into context, it is worth emphasizing that most analyses of randomized ensembles deal with specialized types of prediction functions $T_{1},\dots,T_{t}$ that are much simpler than the ones used in practice (e.g. Lin and Jeon, 2006; Arlot and Genuer, 2014; Biau et al., 2008; Biau, 2012; Scornet et al., 2015; Scornet, 2016a, b; Lopes, 2019). By contrast, our current results for regression only rely on the representation (1.2) and basic moment assumptions (to be detailed in Section 3). In particular, the crucial ingredient that enables us to handle general types of prediction functions is a version of Rosenthal’s inequality due to Talagrand (1989), which is applicable to sums of independent Banach-valued random variables. Moreover, this allows our analysis to be fully non-asymptotic.

Outline.

The remainder of the paper is organized as follows. The proposed methods are described in Section 2, and our main result on bootstrap consistency is presented in Section 3. Next, the computational cost of the methods is assessed in Section 4, and numerical experiments are given in Section 5. Finally, all proofs are given in the supplementary material.

2 Methodology

Below, we present our core method for measuring algorithmic convergence with respect to $\textsc{mse}_{t}$ in Section 2.1. Next, we show how this approach can be extended to measuring convergence with respect to variable importance in Section 2.2.

2.1 Measuring convergence with respect to mean-squared error

The intuition for the proposed method is based on two main considerations. First, the definition of $\textsc{mse}_{t}$ in equation (1.3) shows that it can be interpreted as a functional of $\bar{T}_{t}$ . More specifically, if we let $f:\mathcal{X}\to\mathbb{R}$ denote a generic function, then we define the functional $\psi$ according to

[TABLE]

and it follows that $\textsc{mse}_{t}$ can be written as

[TABLE]

Second, it is a general principle that bootstrap methods are well-suited to approximating distributions derived from smooth functionals of sample averages — which is precisely what the representation (2.2) entails.

To make a more direct connection between these general ideas and the problem of estimating $q_{1-\alpha}(t)$ , recall that we actually need to approximate the distribution of the difference $\textsc{mse}_{t}-\text{mse}_{\infty}$ , rather than just $\textsc{mse}_{t}$ itself. Fortunately, the limiting value $\text{mse}_{\infty}$ can be linked with $\psi$ through function

[TABLE]

where the expectation is only over the algorithmic randomness in $\bar{T}_{t}$ (i.e. over the random vector $\boldsymbol{\xi}_{t}$ ). More specifically, when the functions $T_{1},\dots,T_{t}$ satisfy the representation (1.2), the law of large numbers implies $\text{mse}_{\infty}=\psi(\vartheta)$ under basic integrability assumptions, which leads to the relation

[TABLE]

This relation is the technical foundation for the proposed method, since it suggests that in order to mimic the fluctuations of $\textsc{mse}_{t}-\text{mse}_{\infty}$ , we can develop a bootstrap method by viewing the functions $T_{1},\dots,T_{t}$ as “observations”, and viewing $\bar{T}_{t}$ as an estimator of $\vartheta$ . In other words, if we sample $t$ functions $T_{1}^{*},\dots,T_{t}^{*}$ with replacement from $T_{1},\dots,T_{t}$ , then we can formally define a bootstrap sample of $\textsc{mse}_{t}-\text{mse}_{\infty}$ according to

[TABLE]

where $\bar{T}_{t}^{*}:=\frac{1}{t}\sum_{i=1}^{t}T_{i}^{*}$ . In turn, after generating a collection of such bootstrap samples, we can use their empirical $(1-\alpha)$ -quantile as an estimate of $q_{1-\alpha}(t)$ . However, as a technical point, it should be noted that (2.5) is a “theoretical” bootstrap sample of $\textsc{mse}_{t}-\text{mse}_{\infty}$ , because the functional $\psi$ depends on the unknown distribution of the test point $\mathcal{L}(X,Y)$ . Nevertheless, the same reasoning can still be applied by replacing $\psi$ with an estimate $\widehat{\psi}$ , which will be explained in detail later in this subsection. Altogether, the method is summarized by the following algorithm.

Using hold-out or out-of-bag samples.

To complete our discussion of Algorithm 1, it remains to clarify how the functional $\psi$ can be estimated from either hold-out samples, or so-called “out-of-bag” (oob) samples. With regard to the first case, suppose a set of $m$ labeled samples $\tilde{\mathcal{D}}=\{(\tilde{X}_{1},\tilde{Y}_{1}),\dots,(\tilde{X}_{m},\tilde{Y}_{m})\}$ has been held out from the training set $\mathcal{D}$ . Using this set, the estimate $\widehat{\psi}(\bar{T}_{t})$ in Algorithm 1 can be easily obtained as

[TABLE]

Analogously, we may also obtain $\widehat{\psi}(\bar{T}_{t}^{*})$ by using $\bar{T}_{t}^{*}$ instead of $\bar{T}_{t}$ in the formula above.

If the regression functions $T_{1},\dots,T_{t}$ are trained via bagging or random forests, it is possible to avoid the use of a hold-out set by taking advantage of oob samples, which are a unique attribute of these methods. To define the notion of an oob sample, recall that these methods train each function $T_{i}$ using a random set $\mathcal{D}_{i}^{*}$ obtained from $\mathcal{D}$ by sampling with replacement. Due to this sampling mechanism, it follows that each set $\mathcal{D}_{i}^{*}$ is likely to exclude approximately $(1-\frac{1}{n})^{n}\approx 37\%$ of the training points in $\mathcal{D}$ . So, as a matter of terminology, if a particular training point $X_{j}$ does not appear in $\mathcal{D}_{i}^{*}$ , we say that $X_{j}$ is “out-of-bag” for the function $T_{i}$ . Also, we write $\textsc{oob}(X_{j})\subset\{1,\dots,t\}$ to denote the index set corresponding to the functions for which $X_{j}$ is oob.

From a statistical point of view, oob samples are important because they serve as “effective” hold-out points. (That is, if $X_{j}$ is oob for $T_{i}$ , then the function $T_{i}$ “never touched” the point $X_{j}$ during the training process.) Hence, it is natural to consider the following alternative estimate of $\psi$ based on oob samples,

[TABLE]

where we define $\bar{T}_{t,\textsc{o}}(X_{j})$ to be the average over the functions for which $X_{j}$ is oob,

[TABLE]

and $|\cdot|$ refers to the cardinality of a set. Similarly, we define $\widehat{\psi}_{\textsc{o}}(\bar{T}_{t}^{*})$ by replacing each function $T_{i}$ above with $T_{i}^{*}$ . Lastly, in the case when $\textsc{oob}(X_{j})$ is empty, we arbitrarily define $\bar{T}_{t,\textsc{o}}(X_{j})=Y_{j}$ , but this occurs very rarely. In fact, it can be checked that for a given point $X_{j}$ , the set $\textsc{oob}(X_{j})$ is empty with probability approximately equal to $(0.63)^{t}$ .

2.2 Measuring convergence with respect to variable importance

In addition to their broad application in prediction problems, randomized ensembles have been very popular for the task of variable selection (e.g. Díaz-Uriarte and De Andres, 2006; Strobl et al., 2008; Ishwaran, 2007; Genuer et al., 2010; Louppe et al., 2013; Genuer et al., 2015; Gregorutti et al., 2017). Although a variety of procedures have been proposed for variable selection in this context, they are generally based on a common approach of ranking the variables according to a measure of averaged variable importance (VI). Under this approach, the averaged VI assigned to each variable typically converges to a limiting value as the ensemble becomes large. However, in practice, the user does not know how this convergence depends on the ensemble size — much like we have seen already for $\textsc{mse}_{t}$ .

Uniform convergence across variables.

Before moving on to the details of our extended method, it is worth emphasizing an extra subtlety of measuring algorithmic convergence for VI. Specifically, we must keep in mind that because variable selection is based on ranking, it is important that algorithmic convergence is reached for all variables. In other words, if the VI for some variables converges more slowly than for others, then the ranking of variables will be distorted by purely algorithmic effects. For this reason, our extended method will provide a way to ensure that algorithmic convergence is achieved uniformly across all variables.

Setup for variable importance.

To describe algorithmic convergence for VI in detail, let $T_{1},\dots,T_{t}$ be a randomized ensemble that satisfies the representation (1.2), and consider a situation where the training samples have $p\geq 1$ variables (i.e. the space $\mathcal{X}$ is $p$ -dimensional). Also, suppose that for each function $T_{i}$ , we have a rule for assigning an importance value to each variable $l\in\{1,\dots,p\}$ . Due to the fact that $T_{i}$ is a random function, it follows that the importance value is a random variable, denoted by $\textsc{vi}_{i}(l)$ . (Choices for computing this will be discussed shortly.) Likewise, the vector of such values associated with $T_{i}$ is denoted $\textsc{vi}_{i}=(\textsc{vi}_{i}(1),\dots,\textsc{vi}_{i}(p))$ , and the averaged vector of importance measures is denoted as

[TABLE]

Hence, by comparing the entries of the vector $\overline{\textsc{vi}}_{t}=(\overline{\textsc{vi}}_{t}(1),\dots,\overline{\textsc{vi}}_{t}(p))$ , the user is then able to rank the variables, and this is commonly done using a built-in option from the standard random forests software package (Liaw and Wiener, 2002).

Up to this point, we have not specified a particular rule for computing the values $\textsc{vi}_{i}(l)$ , but several choices are available. For instance, two of the prevailing choices for regression are based on the notions of “node impurity” (for regression trees) or “random permutations” (for general regression functions). However, from an abstract point of view, our proposed method does not depend on the underlying details of these rules, and so we refer to the book (Friedman et al., 2001, Sec 15.3.2) for additional background. Indeed, our proposed method is applicable to any VI rule, provided that the random vectors $\textsc{vi}_{1},\dots,\textsc{vi}_{t}$ are conditionally i.i.d. given $\mathcal{D}$ . In particular, this property is satisfied by both of the mentioned rules when $T_{1},\dots,T_{t}$ follow the representation (1.2).

When the conditional i.i.d. property for $\textsc{vi}_{1},\dots,\textsc{vi}_{t}$ holds and $\mathcal{D}$ is held fixed, the average $\overline{\textsc{vi}}_{t}$ will generally converge to a limiting vector $\textup{vi}_{\infty}\in\mathbb{R}^{p}$ as $t\to\infty$ . In order to measure this convergence uniformly across $l\in\{1,\dots,p\}$ , we will focus on the (unknown) random variable

[TABLE]

and our goal will be to estimate its $(1-\alpha)$ -quantile, denoted as

[TABLE]

The bootstrap method for variable importance.

By analogy with our method for estimating the quantiles of $\textsc{mse}_{t}-\text{mse}_{\infty}$ , we propose to construct bootstrap samples of $\varepsilon_{t}$ by resampling the vectors $\textsc{vi}_{1},\dots,\textsc{vi}_{t}$ , and then estimating ${\tt{q}}_{1-\alpha}(t)$ with the empirical $(1-\alpha)$ -quantile. In algorithmic form, the procedure works as follows.

Numerical results illustrating the performance of this algorithm, as well as Algorithm 1, are given Section 5.

3 Main result

In this section, we develop the main theoretical result of the paper, which guarantees that a bootstrap estimate of $q_{1-\alpha}(t)$ serves its intended purpose. Namely, if this estimate is denoted by $\widehat{q}_{1-\alpha}(t)$ , then we will show that for a fixed set $\mathcal{D}$ , the inequality

[TABLE]

holds with a probability that is effectively $1-\alpha$ .

To establish this result, we will rely on a type of simplification that is commonly used in the analysis of bootstrap methods, which is to exclude sources of error beyond the resampling process itself. More specifically, we will focus on bootstrap samples of the form $\textsc{mse}_{t}^{*}-\textsc{mse}_{t}$ (defined in equation (2.5)), since these are not affected by the extraneous error from estimating the functional $\psi$ . A key benefit of this choice is that it clarifies how the performance of the bootstrap is related to the characteristics of the ensemble. Meanwhile, even with such a simplification, the proof of the result is still quite involved. Likewise, this choice was also used in our previous analysis of the classification setting for the same reasons (Lopes, 2019). Apart from this detail, the analysis in the current paper is entirely different.

With regard to the ensemble, the only assumptions used in our analysis are that it satisfies the representation (1.2), as well as some basic moment conditions. From the standpoint of existing theory for randomized ensembles, these assumptions are very mild — because the representation (1.2) is always satisfied by bagging and random forests. By contrast, it is much more common in the theoretical literature to work with ensembles that are simpler than the ones used in practice; and indeed, our previous work in the classification setting relied on a highly specialized type of ensemble. Furthermore, the moment parameters in our current result are guaranteed to be finite in the important case when $T_{1},\dots,T_{t}$ are trained by CART, as will be explained shortly. Finally, it is notable that our result is fully non-asymptotic, whereas much existing work on the convergence of randomized ensembles has taken an asymptotic approach that does not always provide explicit rates of convergence.

Notation.

If $g$ and $h$ are real-valued functions on $\mathcal{X}\times\mathbb{R}$ , we denote their inner product with respect to the test point distribution $\nu=\mathcal{L}(X,Y)$ as

[TABLE]

and accordingly, we write $\|g\|_{L_{2}}=\sqrt{\langle g,g\rangle}$ . In addition, recall the function $\vartheta(x)=\mathbb{E}[T_{1}(x)|\mathcal{D}]$ from equation (2.3), and define the random variable

[TABLE]

where the expression $\vartheta-y$ is interpreted as the function $(x,y)\mapsto\vartheta(x)-y$ . When the random variable $\zeta$ is conditioned on $\mathcal{D}$ , we denote its standard deviation by

[TABLE]

and the finiteness of this quantity will follow from assumption A2 below. Also, all expressions involving $1/\sigma(\mathcal{D})$ will be understood as $\infty$ in the exceptional case when $\sigma(\mathcal{D})=0$ . Lastly, for each positive integer $k$ , we define the moment parameter

[TABLE]

which provides a convenient way to quantify the tail behavior of the random variables $\|T_{1}-y\|_{L_{2}}$ and $\|T_{1}-\vartheta\|_{L_{2}}$ .

Assumptions.

With the above notation in place, we can state the two assumptions needed for our main result.

A1. The ensemble $T_{1},\dots,T_{t}$ can be represented in the form (1.2).

A2. There is at least one integer $k\geq 2$ such that $\beta_{3k}(\mathcal{D})<\infty$ .

To interpret these assumptions, recall that A1 is always satisfied by bagging and random forests, as explained in Section 1.1. Regarding the finiteness of $\beta_{3k}(\mathcal{D})$ in A2, it is noteworthy that this condition is satisfied for arbitrarily large values of $k$ whenever the functions $T_{1},\dots,T_{t}$ are trained by the standard method of CART. This is because the range of the functions is determined by the training labels $Y_{1},\dots,Y_{n}$ . In particular, if we put $M(\mathcal{D})\!:=\max_{1\leq i\leq n}|Y_{i}|$ , then every tree $T_{i}$ satisfies the bound $\sup_{x\in\mathcal{X}}|T_{i}(x)|\ \leq\ M(\mathcal{D})$ , which implies

[TABLE]

for every $k$ . The same reasoning also applies beyond CART to any other method whose predictions fall within the range of the training labels. We now state the main result of the paper.

Theorem 3.1.

Suppose that A1 and A2 hold. In addition, let $k\geq 2$ be as in A2, and let $\widehat{q}_{1-\alpha}(t)$ denote the empirical $(1-\alpha)$ -quantile of $B$ bootstrap samples of the form (2.5). Lastly, define the quantity

[TABLE]

Then, there is an absolute constant $c_{0}>0$ such that $\widehat{q}_{1-\alpha}(t)$ satisfies

[TABLE]

Remarks.

In essence, the result shows that $\widehat{q}_{1-\alpha}(t)$ bounds the unknown convergence gap $\textsc{mse}_{t}-\text{mse}_{\infty}$ with a probability that is not much less than the ideal value of $1-\alpha$ . To comment on some further aspects of the result, note that the inequality (3.5) has the desirable property of being scale-invariant with respect to the labels $Y_{1},\dots,Y_{n}$ and the functions $T_{1},\dots,T_{t}$ . More precisely, if we were to change the units of the labels and functions by a scale factor $c>0$ , it can be checked that both sides of (3.5) would remain unchanged.

Another important aspect of Theorem 3.1 deals with the dependence of $\delta_{t,k,B}(\mathcal{D})$ on the value of $k$ . Specifically, it is interesting to develop a bound on $\delta_{t,k,B}(\mathcal{D})$ that simplifies the role of $k$ . To do this, we now consider the situation when the regression functions are trained by CART, or more generally, when the boundedness condition $\beta_{k}(\mathcal{D})\leq 4M(\mathcal{D})^{2}$ holds for every $k\geq 1$ , as in (3.3). In such cases, we may evaluate the particular choice

[TABLE]

which leads to the following bounds,

[TABLE]

for some absolute constant $c_{0}>0$ and all $t\geq 2$ . These bounds imply that there is a number $c(\mathcal{D})>0$ not depending on $t$ , $k$ , or $B$ , such that

[TABLE]

which considerably simplifies the interpretation of $\delta_{t,k,B}(\mathcal{D})$ . Hence, at a high level, this indicates that as long as the regression functions have well-behaved moments, then for a fixed set $\mathcal{D}$ , the quantity $\delta_{t,k,B}(\mathcal{D})$ converges to 0 at nearly parametric rates with respect to both $t$ and $B$ .

4 Computation and speedups

In order for the proposed method to be a practical a tool for checking algorithmic convergence, its computational cost should be manageable in comparison to training the ensemble itself. Below, in Section 4.1, we offer a quantitative comparison, showing that under simple conditions, Algorithms 1 and 2 are not a bottleneck in relation to training $t$ regression functions with CART. Additionally, we show in Section 4.2 how an extrapolation technique from our previous work on classification can be improved in our current setting with a bias correction rule.

4.1 Cost comparison

Because the CART method is based on a greedy iterative algorithm, the exact computational cost of training a regression tree is difficult to describe. For this reason, the authors of CART analyzed its cost in the simplified situation where each node of a regression tree is split into exactly 2 child nodes (except for the leaves). To be more precise, suppose $\mathcal{X}\subset\mathbb{R}^{p}$ , and let $d\geq 2$ denote the “depth” of the tree, so that there are $2^{d}$ leaves. In addition, suppose that when the algorithm splits a given node, it searches over $\lceil p/3\rceil$ candidate variables that are randomly chosen from $\{1,\dots,p\}$ , which is the default rule when CART is used by random forests (Liaw and Wiener, 2002). Based on these assumptions, the analysis in the book (Breiman et al., 1984, p.166) shows that the number of operations involved in training $t$ such trees is at least of order $\Omega(t\cdot p\cdot d\cdot n)$ .

The cost of Algorithm 1.

To determine the cost of Algorithm 1, it is important to clarify that when bagging and random forests are used in practice, the prediction error of the ensemble is typically estimated automatically using either hold-out or oob samples. As a result, the predicted values of each tree on these samples can be regarded as being pre-computed by the ensemble method. Once these values are available, the subsequent cost of Algorithm 1 is simple to measure. Specifically, in the case of hold-out samples, equation (2.6) shows that the cost to obtain $\widehat{\psi}(\bar{T}_{t})-\widehat{\psi}(\bar{T}_{t}^{*})$ for each bootstrap sample is $\mathcal{O}(t\cdot m)$ , which leads to an overall cost that is $\mathcal{O}(B\cdot t\cdot m)$ . Similarly, for the case of oob samples, the overall cost is $\mathcal{O}(B\cdot t\cdot n)$ . Altogether, this leads to the conclusion that the cost of Algorithm 1 does not exceed that of training the ensemble if the number of bootstrap samples satisfies the very mild condition

[TABLE]

and this applies to either the hold-out or oob cases, provided $m=\mathcal{O}(n)$ . Moreover, our discussion in Section 4.2 will show that the condition (4.1) can even be further relaxed via extrapolation.

Beyond the fact that Algorithm 1 compares well with the cost of training an ensemble, there are several other favorable aspects worth mentioning. First, the algorithm only relies on predicted labels for its input, and it never needs to access any points in the space $\mathcal{X}$ . In particular, this means that the cost of the algorithm is independent of the dimension of $\mathcal{X}$ . Second, the bootstrap samples in Algorithm 1 are simple to compute in parallel, which means that the cost of the algorithm can be reduced approximately by a factor of $B$ .

The cost of Algorithm 2.

Many of the previous considerations for Algorithm 1 also apply to Algorithm 2, but it turns out that the cost of Algorithm 2 can be much less when $n$ is large. Because each bootstrap sample in Algorithm 2 requires forming an average of $t$ vectors in $\mathbb{R}^{p}$ , it is straightforward to check that the overall cost is $\mathcal{O}(B\cdot t\cdot p)$ , where we view the vectors $\textsc{vi}_{1},\dots,\textsc{vi}_{t}$ as being pre-computed by the ensemble method. In particular, it is worth emphasizing that the cost of the algorithm is independent of $n$ , and is thus highly scalable. Furthermore, under the setup of our earlier cost comparison with CART, the cost of Algorithm 2 does not exceed the cost of training the ensemble if

[TABLE]

which allows for plenty of bootstrap samples in practice. In fact, our numerical experiments show that even $B=50$ can work well when $n$ is on the order of $10^{4}$ , indicating that Algorithm 2 is quite inexpensive in comparison to training.

4.2 Further reduction of cost by extrapolation

The basic idea of extrapolation is to check algorithmic convergence for a small “initial” ensemble, say of size $t_{0}$ , and then use this information to “look ahead” and predict convergence for a larger ensemble of size $t>t_{0}$ . This general technique has a long history in the development of numerical algorithms, and further background can be found in (Bickel and Yahav, 1988; Brezinski and Zaglia, 2013; Sidi, 2003) as well as references therein. In the remainder of this section, we first summarize how extrapolation was previously developed in our companion paper (Lopes, 2019), and then explain how that approach can be improved with a bias correction for oob samples.

A basic version of extrapolation.

At a technical level, our use of extrapolation is based on the central limit theorem, which suggests that the fluctuations of $\textsc{mse}_{t}-\text{mse}_{\infty}$ should scale like $1/\sqrt{t}$ as a function of $t$ . As a result, we expect that the quantile $q_{1-\alpha}(t)$ should behave like

[TABLE]

for some quantity $\kappa$ that may depend on all problem parameters except $t$ .

To take advantage of this heuristic scaling property, suppose that we train an initial ensemble of size $t_{0}$ , and run Algorithm 1 to obtain an estimate $\widehat{q}_{1-\alpha}(t_{0})$ . We can then extract an estimate of $\kappa$ by defining

[TABLE]

Next, we can rapidly estimate $q_{1-\alpha}(t)$ for all subsequent $t\geq t_{0}$ by defining the extrapolated estimate

[TABLE]

In particular, there are two crucial benefits of this estimate: (1) It is much faster to apply Algorithm 1 to a small initial ensemble of size $t_{0}$ than to a large one of size $t$ . (2) If we would like $\textsc{mse}_{t}$ to be within some tolerance $\epsilon>0$ of the limit $\text{mse}_{\infty}$ , then we can use the condition

[TABLE]

to dynamically predict how large $t$ must be chosen to reach that tolerance, namely $t\geq(\sqrt{t_{0}}\widehat{q}_{1-\alpha}(t_{0})/\epsilon)^{2}$ .

Bias-corrected extrapolation.

If the initial estimate $\widehat{q}_{1-\alpha}(t_{0})$ is obtained by implementing Algorithm 1 with oob samples, it turns out to be a biased estimate of $q_{1-\alpha}(t_{0})$ . Fortunately however, it is possible to correct for this bias in a simple way, as we now explain.

To understand the source of the bias, recall that for each point $X_{j}$ in the training set, we write $\textsc{oob}(X_{j})\subset\{1,\dots,t\}$ to index the regression functions for which $X_{j}$ is oob. Also, it is simple to check that for an initial ensemble of size $t_{0}$ , the expected cardinality of $\textsc{oob}(X_{j})$ is given by

[TABLE]

In other words, this means that when an ensemble of size $t_{0}$ makes a prediction on an oob point, the “effective” size of the ensemble is $\tau_{n}(t_{0})$ , rather than $t_{0}$ . As a result, if we implement Algorithm 1 using oob samples with an initial ensemble of size $t_{0}$ , then the output $\widehat{q}_{1-\alpha}(t_{0})$ should really be viewed as an estimate of $q_{1-\alpha}(\tau_{n}(t_{0}))$ , rather than $q_{1-\alpha}(t_{0})$ .

Based on this reasoning, we can adjust our previous definition of the estimate $\widehat{q}_{1-\alpha}^{\ \text{ext}}(t)$ in (4.2) by using

[TABLE]

Later on, in Section 5 we will demonstrate that this simple adjustment works quite well in practice.

Remark.

As a clarification, it should be noted that the definition (4.4) is only to be used when Algorithm 1 is implemented with oob samples, and the basic rule (4.2) should be used in the case of hold-out samples. Also, the basic rule (4.2) can be easily adapted to extrapolate the estimate produced by Algorithm 2, and so we omit the details in the interest of brevity.

5 Numerical results

We now demonstrate the bootstrap’s numerical accuracy at the tasks of measuring algorithmic convergence with respect to both mean-squared error and variable importance. Overall, our results show that the extrapolated oob estimate is accurate at predicting the effect of increasing $t$ . In fact, the results show that extrapolation succeeds at predicting what will happen when $t$ is increased by a factor of 4 beyond $t_{0}$ , and possibly much farther.

5.1 Organization of experiments

Data preparation.

Our experiments were based on several natural datasets that were each randomly partitioned in the following way. Letting $\mathcal{F}$ denote the full set of observation pairs $(X_{1},Y_{1}),(X_{2},Y_{2}),\dots$ for a given dataset, we evenly split $\mathcal{F}$ into a disjoint union $\mathcal{F}=\mathcal{D}\sqcup\mathcal{T}$ , where the set $\mathcal{D}$ was used for training, and the set $\mathcal{T}$ was used to approximate the true quantile curves (namely $q_{1-\alpha}(t)$ or ${\tt{q}}_{1-\alpha}(t)$ ) for assessing algorithmic convergence.

Since Algorithm 1 relies on a hold-out set, we also used a relatively small subset $\mathcal{H}\subset\mathcal{T}$ for that purpose. Specifically, the hold-out set $\mathcal{H}$ was chosen so that its cardinality satisfied $|\mathcal{H}|/(|\mathcal{H}|+|\mathcal{D}|)\approx 1/6$ . This reflects a practical situation where the user can only afford to allocate $1/6$ of the available data for the hold-out set. In other words, the idea is to think of the user as only having access to $\mathcal{D}\sqcup\mathcal{H}$ , with the set $\mathcal{T}$ as being used externally to establish “ground truth” for the rate of algorithmic convergence.

Each of the full datasets are briefly summarized below.

•

Diamond: This dataset is available in the package ggplot2 (Wickham, 2016), and has been downsampled to 10,000 observations. Each observation contains 9 measured features of a distinct diamond, and the features are used to predict the diamond’s price.

•

Housing: This dataset originates from 1990 California census and is available as part of the online supplement to the book (Géron, 2017). The observations are correspond to 20,640 homes, and for each home there are 9 features for predicting the home’s price.

•

Music: This dataset consists of 1,059 audio recordings (observations) described by 68 features that are used to predict the geographic latitude of the recording, as described in (Zhou et al., 2014). The dataset is available at the UCI repository (Dua and Graff, 2017) under the title Geographical Origin of Music Data Set.

•

Protein: This is dataset was collected from the fifth through ninth series of CASP experiments (Moult et al., 2011), and is available at the UCI repository (Dua and Graff, 2017) under the title Physicochemical Properties of Protein Tertiary Structure Data Set. The 45,730 observations correspond to artificially generated conformations of proteins (known as decoys) that are described by 9 biophysical features. Each decoy can be thought of as a perturbation of an associated “target” protein, and the features are used to predict how far the decoy is from its target.

Computing the true quantile curves $q_{1-\alpha}(t)$ and ${\tt{q}}_{1-\alpha}(t)$ .

Once a full dataset $\mathcal{F}$ was partitioned as above, we ran the random forests algorithm 1,000 times on the associated set $\mathcal{D}$ , using the R package randomForest (Liaw and Wiener, 2002). The overall process was a serious computational undertaking, because $2,\!000$ regression trees were trained during every run, and hence a total of $1,\!000\times 2,\!000=2\times 10^{6}$ trees were trained on each dataset.

During each run, as the ensemble size increased from $t=1$ to $t=2,\!000$ , the corresponding true values of $\textsc{mse}_{t}$ were approximated with the ensemble’s error rate on $\mathcal{T}$ . Also, the true value of $\text{mse}_{\infty}$ was approximated with the average of the 1,000 realizations of $\textsc{mse}_{2,000}$ . In this way, the collection of runs produced 1,000 approximate sample paths of $\textsc{mse}_{t}-\text{mse}_{\infty}$ , similar to those illustrated in the right panel of Figure 1. Finally, the quantile curve $q_{.90}(t)$ was extracted by using the empirical 90% quantile of the 1,000 values of $\textsc{mse}_{t}-\text{mse}_{\infty}$ at each $t=1,\dots,2,\!000$ .

To handle the setting of variable importance, essentially the same steps were used. Specifically, we computed the vector $\overline{\textsc{vi}}_{t}\in\mathbb{R}^{p}$ at every value $t=1,\dots,2,\!000$ , for each of the 1,000 runs mentioned above. In addition, we approximated the vector $\text{vi}_{\infty}\in\mathbb{R}^{p}$ with the average of the 1,000 realizations of $\overline{\textsc{vi}}_{2,000}$ . Altogether, these computations provided us with 1,000 approximate sample paths of $\varepsilon_{t}=\max_{1\leq l\leq p}|\overline{\textsc{vi}}_{t}(l)-\text{vi}_{\infty}(l)|$ , and then we used the empirical 90% quantile at each $t=1,\dots,2,\!000$ to approximate ${\tt{q}}_{.90}(t)$ .

Applying the bootstrap algorithms with extrapolation.

For each of the described 1,000 runs of random forests, we applied the extrapolated versions of Algorithms 1 and 2 at the initial ensemble size of $t_{0}=500$ , using a small number of $B=50$ bootstrap samples. Hence, this provided us 1,000 realizations of each type of the proposed estimates, which allows for an assessment of their variability.

Below, in Sections 5.2 and 5.3, we will show the results obtained by extrapolating to the final ensemble size of $t=2,\!000$ . In addition, for Algorithm 1, we implemented both of the hold-out and oob versions, including the bias correction for the oob samples described in equation (4.4).

5.2 Numerical results for mean-squared error

Organization of the plots.

The two types of estimates for $q_{.90}(t)$ are illustrated in Figures 4 through 6, with the hold-out estimator in green, and the oob estimator in blue. More specifically, these curves represent the averages of the estimates over the 1,000 runs described above, and the error bars display the fluctuations of the estimates over repeated runs —corresponding to the 10th and 90th percentiles of the estimates. For the values of $t$ between the endpoints, we omit the error bars for clarity. Also, it is important to emphasize that these error bars should not be interpreted as confidence intervals for $q_{.90}(t)$ , and are only intended to show that the estimates have low variance.

With regard to computation, another point to mention is that the estimates were only computed for the initial ensemble size $t_{0}=500$ , and the rest of the green and blue curves were obtained essentially for free by extrapolation. Lastly, as a clarification, it should be noted that the blue oob curve is shifted to the left of the green hold-out curve because of the bias correction rule (4.4) for oob samples.

Remarks on performance.

The main point to take away from the plots is that the oob estimate performs quite well overall, and can be much more accurate than the hold-out estimate (cf. Figures 6 and 6). Furthermore, the oob estimate has an extra advantage because it does not require the user to hold out any data. For these reasons, we recommend the oob estimate in practice.

Another conclusion to draw from the plots is that the bias correction plays a significant role in the extrapolation of the oob estimate. If the bias correction were not used, this would be equivalent to shifting the blue curve so that it starts at the same point as the green curve, which would clearly lead to a loss in accuracy. Also, it is remarkable that the extrapolated oob estimator continues to be accurate at a final ensemble size of $t=2,\!000$ that is 4 times larger than the initial ensemble size $t_{0}=500$ . Hence, this provides the user with a very inexpensive way to predict how quickly the ensemble will converge. Moreover, even in the cases where the extrapolation starts from a mediocre initial estimate, the accuracy tends to improve as $t$ becomes larger.

To explain the inferior performance of the hold-out estimate, recall that it uses the small set $\mathcal{H}$ in order to estimate $\textsc{mse}_{t}$ . As a result, the estimates of $\textsc{mse}_{t}$ using $\mathcal{H}$ have much more variability, which inflates the upper extremes of the estimator’s sampling distribution, and thus leads to a larger estimate of $q_{.90}(t)$ . On the other hand, the oob estimator is able to take advantage of the oob samples in the much larger set $\mathcal{D}$ , which reduces this detrimental effect.

5.3 Numerical results for variable importance

The results in the setting of variable importance are simpler to describe, since there is only one type of estimate for ${\tt{q}}_{.90}(t)$ . In Figures 8 through 10, we plot the average of the 1,000 realizations of the estimates using a blue curve, while the error bars at the endpoints represent the 10% and 90% empirical quantiles of the estimates. In addition, the extrapolation procedure was based on an initial ensemble size of $t_{0}=500$ , as in the previous subsection. From the four plots, it is clear that the extrapolated estimate displays excellent overall performance, with its bias and variance both being very small.

Outline of proofs.

The key points of the proof of Theorem 3.1 are explained in Appendix A, and the primary lemmas are given in Appendix B. These lemmas rely on secondary technical results and background facts which are given in Appendices C and D respectively.

Notation and conventions.

To simplify presentation, letters such as $c,c_{0},c_{1},$ etc., will be re-used to refer to positive absolute constants, not depending on $t$ , $B$ , or $k$ , and likewise, these letters may take a different value at each occurrence. Regarding the quantity $\delta_{t,k,B}(\mathcal{D})$ defined in equation (3.4) of Theorem 3.1, we will omit the subscripts and write $\delta(\mathcal{D})$ in order to lighten notation. In addition, if $C\geq 1$ is an absolute constant, we may assume without loss of generality that

[TABLE]

because if the constant $c_{0}$ in Theorem 3.1 is chosen to satisfy $c_{0}\geq C$ , then the result is clearly true when $\delta(\mathcal{D})\geq\frac{1}{C}$ . Next, we will often make use of the following basic moment relations involving quantities defined on page 3,

[TABLE]

These relations are straightforward to verify using the Cauchy-Schwarz and Jensen inequalities, and hence the details are omitted. Furthermore, under the above condition $\delta(\mathcal{D})<\frac{1}{C}$ , these relations and the definition of $\delta(\mathcal{D})$ in (3.4) imply that

[TABLE]

which will be useful in simplifying some expressions. Next, recall that the quantile function $G^{-1}$ associated with a generic distribution function $G$ is defined as

[TABLE]

for any $r\in(0,1)$ . Lastly, the supremum norm of a function $h:\mathbb{R}\to\mathbb{R}$ is written as $\|h\|_{\infty}=\sup_{s\in\mathbb{R}}|h(s)|$ .

Appendix A High-level proof of Theorem 3.1

Define the following distribution functions at any $s\in\mathbb{R}$ ,

[TABLE]

and

[TABLE]

where each $\textsc{mse}_{t,l}^{*}-\textsc{mse}_{t}$ is an independent copy of the bootstrap sample (2.5), conditionally on $\mathcal{D}$ and $\boldsymbol{\xi}_{t}$ .

In Proposition A.1 below, we will show there is an absolute constant $c_{1}>0$ such that

[TABLE]

which is the most substantial part of the proof. Next, recall that $q_{1-\alpha}(t)$ and $\widehat{q}_{1-\alpha}(t)$ are defined to satisfy

[TABLE]

and let $\mathcal{E}$ be an event defined by

[TABLE]

By intersecting the event $\{\textsc{mse}_{t}-\text{mse}_{\infty}>\widehat{q}_{1-\alpha}(t)\}$ with $\mathcal{E}$ and $\mathcal{E}^{c}$ , it follows that

[TABLE]

In turn, observe that if the event $\{\|\widehat{F}-F\|_{\infty}\leq c_{1}\delta(\mathcal{D})\}$ holds, then

[TABLE]

which implies that the event $\mathcal{E}$ contains $\{\|\widehat{F}-F\|_{\infty}\leq c_{1}\delta(\mathcal{D})\}$ . In other words, the bound (A.3) implies

[TABLE]

Combining this with (A.5) gives

[TABLE]

Finally, it is clear that there is an absolute constant $c_{2}>0$ such that $4e^{-k/2}+2/B^{2}\leq c_{2}\delta(\mathcal{D})$ , and so the proof is complete. ∎

Proposition A.1.

Suppose the conditions of Theorem 3.1 hold. Then, there is an absolute constant $c_{1}>0$ such that

[TABLE]

Proof.

For any fixed $s\in\mathbb{R}$ , define the distribution function

[TABLE]

Clearly,

[TABLE]

The proof amounts to bounding the two terms on the right. To consider the first term $\|\widehat{F}-\tilde{F}\|_{\infty}$ , note that $\widehat{F}$ is the empirical distribution function based on $B$ i.i.d. samples from $\tilde{F}$ . Therefore, we may apply the Dvoretzky-Kiefer-Wolfowitz inequality (Lemma D.2) conditionally on $\mathcal{D}$ and $\boldsymbol{\xi}_{t}$ , and then take the expectation over $\boldsymbol{\xi}_{t}$ to obtain

[TABLE]

Handling the second term $\|\tilde{F}-F\|_{\infty}$ is much more involved. To do this, we consider two random variables $Z$ and $Z^{*}$ , to be defined later, which allow the distance $\|\tilde{F}-F\|_{\infty}$ to be bounded in three parts:

[TABLE]

Specifically, each of the terms on the right side will be handled in Lemmas B.2, B.3, and B.1 respectively. Combining the results of those lemmas shows that there is an absolute constant $c>0$ such that

[TABLE]

Finally, the proof is completed by combining the inequalities (A.12) and (A.14).∎

Appendix B Primary lemmas

This section contains the three essential lemmas for proving Proposition A.1.

Lemma B.1.

Suppose that the conditions of Theorem 3.1 hold. Let $Z$ be a Gaussian random variable generated conditionally on $\mathcal{D}$ as $Z\sim N(0,\sigma^{2}(\mathcal{D}))$ . Also, for any $s\in\mathbb{R}$ , define $F_{Z}(s)=\mathbb{P}(Z\leq s\,|\,\mathcal{D})$ . Then, there is an absolute constant $c>0$ , such that

[TABLE]

Proof.

A bit of algebra gives the relation

[TABLE]

where we define the random variables

[TABLE]

Also, for each $i\in\{1,\dots,t\}$ , define the random variable

[TABLE]

which differs from the previous definition of $\zeta$ in (3.2) only through the dependence on $T_{i}$ . The proof consists in showing that $Z_{t}$ can be approximated by a Gaussian distribution, and that $R_{t}$ is negligible. Observe that $Z_{t}$ can be written as

[TABLE]

and note that the summands $\zeta_{1},\dots,\zeta_{t}$ are centered, and are i.i.d. conditionally on $\mathcal{D}$ . If we define $F_{Z_{t}}(s)=\mathbb{P}(Z_{t}\leq s|\mathcal{D})$ for any $s\in\mathbb{R}$ , then Lemma D.3 implies that the following inequality holds any $r>0$ ,

[TABLE]

where we note that $R_{t}$ is non-negative. Hence, it remains to bound the first and third terms on the right side, and then select a value of $r$ . The first term satisfies the Berry-Esseen bound

[TABLE]

where $\rho(\mathcal{D}):=(\mathbb{E}[|\zeta_{1}|^{3}|\mathcal{D}])^{1/3}$ . Next, the third term $\mathbb{P}(R_{t}>r|\mathcal{D})$ is handled in Lemma C.1, which shows that if we take

[TABLE]

for some absolute constant $c>0$ , then

[TABLE]

Combining the three previous bounds gives

[TABLE]

Finally, we use the following bounds from (A0.2)

[TABLE]

and then the stated result follows from (B.7) after simplifying.∎

Remark.

For the statement and proof of the next lemma, define the random variables

[TABLE]

for each $i\in\{1,\dots,t\}$ , which are conditionally i.i.d. given $(\mathcal{D},\boldsymbol{\xi}_{t})$ , with mean zero. Likewise, define the moments

[TABLE]

Lastly, recall that $\tilde{F}$ is the distribution function of $\sqrt{t}(\textsc{mse}_{t}^{*}-\textsc{mse}_{t})$ given $(\mathcal{D},\boldsymbol{\xi}_{t})$ , as defined in (A.10).

Lemma B.2.

Suppose that the conditions of Theorem 3.1 hold. Let $Z^{*}$ be a Gaussian random variable, generated conditionally on $\mathcal{D}$ and $\boldsymbol{\xi}_{t}$ according to $Z^{*}\sim N(0,\widehat{\sigma}^{2}(\mathcal{D},\boldsymbol{\xi}_{t}))$ . Also, for any $s\in\mathbb{R}$ , let $F_{Z^{*}}(s)=\mathbb{P}(Z^{*}\leq s|\mathcal{D},\boldsymbol{\xi}_{t})$ . Then, there is an absolute constant $c>0$ , such that

[TABLE]

Proof.

The proof can be viewed as the bootstrap counterpart to the proof of Lemma B.1. It is straightforward to verify the relation

[TABLE]

where we define the random variables

[TABLE]

Also, observe that $Z_{t}^{*}$ can be written as

[TABLE]

Next, for any $s\in\mathbb{R}$ , define the conditional distribution function

[TABLE]

In turn, Lemma D.3 gives the following bound for any realization of $\mathcal{D}$ and $\boldsymbol{\xi}_{t}$ , and any fixed $r>0$ ,

[TABLE]

The first term on the right satisfies the Berry-Esseen bound,

[TABLE]

Furthermore, the quantities $\widehat{\rho}(\mathcal{D},\boldsymbol{\xi}_{t})$ and $\widehat{\sigma}(\mathcal{D},\boldsymbol{\xi}_{t})$ can be controlled with the help of the following tail bounds, which are direct consequences of Lemmas C.3 and C.2,

[TABLE]

and

[TABLE]

Next, to use an alternative notation for the third term on the right side of (B.15), let

[TABLE]

and also write its expectation with respect to $\boldsymbol{\xi}_{t}$ as

[TABLE]

Then, Markov’s inequality gives

[TABLE]

In Lemma C.4, we show that if $r$ is chosen as

[TABLE]

for a sufficiently large absolute constant $c>0$ , then the bound

[TABLE]

holds for any realization of $\mathcal{D}$ . Combining the ingredients above, we have

[TABLE]

Finally, the term involving $1/\sqrt{t}$ can be simplified by making use of the simple inequalities

[TABLE]

This leads to the stated result.∎

Lemma B.3.

Suppose that the conditions of Theorem 3.1 hold. Let $F_{Z}$ and $F_{Z^{*}}$ be as defined in the statements of Lemmas B.1 and B.2. Then, there is an absolute constant $c>0$ such that

[TABLE]

Proof.

Recall that $F_{Z}$ and $F_{Z^{*}}$ correspond to centered Gaussian distributions. It is a basic fact about the function $\Phi$ that the following bound holds for any positive numbers $\sigma_{1}$ and $\sigma_{2},$

[TABLE]

where $c>0$ is an absolute constant. Since the respective variances of $F_{Z}$ and $F_{Z^{*}}$ are $\sigma^{2}(\mathcal{D})$ and $\widehat{\sigma}^{2}(\mathcal{D},\boldsymbol{\xi}_{t})$ , this means

[TABLE]

Combining this inequality with Lemma C.2 (below) completes the proof. ∎

Appendix C Secondary lemmas

Remark.

Recall that $R_{t}$ is defined in (B.3) as $R_{t}=\sqrt{t}\|\bar{T}_{t}-\vartheta\|_{L_{2}}^{2}.$

Lemma C.1.

Suppose the conditions of Theorem 3.1 hold. Then, there is an absolute constant $c>0$ , such that

[TABLE]

Proof.

The proof is based on the inequality

[TABLE]

with a suitably chosen number $s>0$ . In order to control $\mathbb{E}[R_{t}^{k}|\mathcal{D}]$ , we will use a version of Rosenthal’s inequality that is applicable to sums of independent Banach-valued random variables, as given in Lemma D.1. Specifically, this lemma shows that

[TABLE]

where $c>0$ is an absolute constant. Regarding the first term on the right, we may use the fact that $T_{1},\dots,T_{t}$ are conditionally i.i.d. given $\mathcal{D}$ to obtain

[TABLE]

The second term on the right side of (C.2) can be bounded as

[TABLE]

Recalling the prefactor of $\sqrt{t}$ in the definition of $R_{t}$ , as well as the fact that $\beta_{1}(\mathcal{D})\leq\beta_{k}(\mathcal{D})$ , it follows that the previous work can be combined as

[TABLE]

Hence, if we take

[TABLE]

in the inequality (C.1), then the proof is complete.∎

Lemma C.2.

Suppose that the conditions of Theorem 3.1 hold. Then, there are absolute constants $c_{0},c_{1}>0$ such that

[TABLE]

and

[TABLE]

Proof.

Note that the second bound (C.7) follows from the first bound (C.6) due to the inequality

[TABLE]

as well as the condition (A0.1). In order to prove (C.6), the main idea is to derive a quantity $b(\mathcal{D})$ satisfying

[TABLE]

and then Chebyshev’s inequality gives

[TABLE]

To derive $b(\mathcal{D})$ , first recall that

[TABLE]

Simple algebra gives the relation

[TABLE]

where we put

[TABLE]

This allows $\widehat{\sigma}^{2}(\mathcal{D},\boldsymbol{\xi}_{t})$ to be written as

[TABLE]

and so the triangle inequality for the conditional $L_{k}$ norm $(\mathbb{E}[|\cdot|^{k}|\mathcal{D}])^{1/k}$ gives

[TABLE]

where the terms on the right are defined as

[TABLE]

To handle the term $A_{1}(\mathcal{D})$ , a straightforward calculation based on the bound $(\mathbb{E}[|\zeta_{1}|^{k}|\mathcal{D}])^{\frac{1}{k}}\leq 2\beta_{k}(\mathcal{D})$ and Rosenthal’s inequality (Lemma D.1) shows that

[TABLE]

where $c>0$ is an absolute constant. Next, using the triangle and Cauchy-Schwarz inequalities, it is simple to check that the second term $A_{2}(\mathcal{D})$ satisfies

[TABLE]

To complete the proof, it suffices to bound the quantity $\big{(}\mathbb{E}\big{[}|\Delta_{1}|^{k}\big{|}\mathcal{D}\big{]}\big{)}^{\frac{1}{k}}$ for general $k$ . Using steps analogous to the ones in the bound (C.15), we obtain

[TABLE]

Next, recall that the argument following the bound (C.2) in the proof of Lemma C.1 leads to

[TABLE]

for some absolute constant $c>0$ . In addition, if we apply a discrete version of Jensen’s inequality

[TABLE]

and use Assumption A2 to get

[TABLE]

then

[TABLE]

To combine the work above, recall the condition (A0.3), and note that we must replace $k$ with $2k$ when relating the bound (C.17) to $\big{(}\mathbb{E}[|\Delta_{1}|^{2k}|\mathcal{D}]\big{)}^{\frac{1}{2k}}$ . Altogether, we conclude

[TABLE]

Hence, if we define $b(\mathcal{D})$ to be the right side above, then the bound (C.8) completes the proof.∎

Lemma C.3.

Suppose the conditions of Theorem 3.1 hold. Then, there is an absolute constant $c>0$ such that

[TABLE]

Proof.

Recall that

[TABLE]

Also, if we recall the relation (C.9)

[TABLE]

with $\Delta_{i}$ as defined in (C.10), then

[TABLE]

To derive a high probability upper bound on $\widehat{\rho}(\mathcal{D},\boldsymbol{\xi}_{t})$ , it is enough to use Chebyshev’s inequality

[TABLE]

in conjunction with a bound on $(\mathbb{E}[\widehat{\rho}(\mathcal{D},\boldsymbol{\xi}_{t})^{3k}|\mathcal{D}])^{1/k}$ . By the triangle inequality for the conditional $L_{k}$ norm $(\mathbb{E}[|\cdot|^{k}|\mathcal{D}])^{1/k}$ , we have

[TABLE]

It is straightforward to check that the first term on the right satisfies

[TABLE]

Meanwhile, the following crude (but adequate) bound for the second term on the right side of (C.22) can be obtained directly from (C.17) and the condition (A0.3),

[TABLE]

Altogether, we have

[TABLE]

and so the stated result follows from the Chebyshev bound (C.21). ∎

Remark.

Recall that $R_{t}^{*}$ is defined in equation (B.13) as

[TABLE]

Lemma C.4.

Suppose the conditions of Theorem 3.1 hold. Then, there is an absolute constant $c>0$ such that

[TABLE]

Proof.

The proof is similar to that of Lemma C.1, and proceeds by developing a bound on the conditional moment

[TABLE]

To begin, note that $\bar{T}_{t}^{*}-\bar{T}_{t}$ is a sum of i.i.d., zero-mean Banach-valued random variables, conditionally on $\mathcal{D}$ and $\boldsymbol{\xi}_{t}$ . So, if we apply Lemma D.1 with the inequality $(a+b)^{2k}\leq 2^{2k}(a^{2k}+b^{2k})$ , then

[TABLE]

where $c>0$ is an absolute constant. Direct calculation shows that the first term on the right satisfies

[TABLE]

where Jensen’s inequality has been used in the last step. Likewise, the second term in (C.23) satisfies

[TABLE]

Hence, if we integrate with respect to $\boldsymbol{\xi}_{t}$ , then (C.23) leads to

[TABLE]

The last factor on the right can be decomposed as

[TABLE]

Combining the last two displays, it follows after some simplification that

[TABLE]

and this completes the proof by using Chebyshev’s inequality in the same manner as in the proof of Lemma C.1.∎

Appendix D Background results

The following inequality is a modified version of the main result in (Talagrand, 1989). (See also (Johnson et al., 1985) and (Kwapień et al., 1991).)

Lemma D.1.

Let $W_{1},\dots,W_{m}$ be independent and zero-mean elements of a Banach space with norm $\|\cdot\|$ . Then, there is an absolute constant $c>0$ , such that for any $r\geq 1$ ,

[TABLE]

In particular, if $W_{1},\dots,W_{m}$ are scalar random variables, and $\|\cdot\|_{r}=\mathbb{E}[|\cdot|^{r}])^{1/r}$ , then

[TABLE]

The next lemma is the Dvoretzky-Kiefer-Wolfowitz inequality (Dvoretzky et al., 1956; Massart, 1990).

Lemma D.2.

Let $\xi_{1},\dots,\xi_{m}$ be independent random variables with a common distribution function $G$ . Also, for any $s\in\mathbb{R}$ , let

[TABLE]

Then, for any fixed $x>0$ ,

[TABLE]

Lemma D.3.

Fix any $\tau>0$ . Let $U,\,V,\,W,$ and $R$ be random variables, satisfying $U=V+R$ , and $W\sim N(0,\tau^{2})$ . Also let $F_{U},\,F_{V},$ and $F_{W}$ denote the distribution functions of the first three variables. Then, for any $r>0$ ,

[TABLE]

Proof.

It is straightforward to check that the following inequalities hold for any $s\in\mathbb{R}$ ,

[TABLE]

and so

[TABLE]

Note also that for any fixed $r>0$ ,

[TABLE]

Here, the first probability on the right side can be bounded as

[TABLE]

Since $W\sim N(0,\tau^{2})$ , its distribution function is Lipschitz with parameter $\frac{1}{\sqrt{2\pi}\tau}$ , and so we have

[TABLE]

for every $s\in\mathbb{R}$ . Combining the last several steps with the bound (D.5) gives

[TABLE]

In turn, adding $\|F_{V}-F_{W}\|_{\infty}$ to both sides leads to the stated bound on $\|F_{U}-F_{W}\|_{\infty}$ .∎

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Arlot and Genuer (2014) Arlot, S. and Genuer, R. (2014) Analysis of purely random forests bias. preprint ar Xiv:1407.3939 .
2Basilico et al. (2011) Basilico, J., Munson, M., Kolda, T., Dixon, K. and Kegelmeyer, W. (2011) Comet: A recipe for learning and using large ensembles on massive data. In Data Mining (ICDM), 2011 IEEE 11th International Conference on , 41–50. IEEE.
3Biau (2012) Biau, G. (2012) Analysis of a random forests model. Journal of Machine Learning Research , 13 , 1063–1095.
4Biau et al. (2008) Biau, G., Devroye, L. and Lugosi, G. (2008) Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research , 9 , 2015–2033.
5Bickel and Yahav (1988) Bickel, P. J. and Yahav, J. A. (1988) Richardson extrapolation and the bootstrap. Journal of the American Statistical Association , 83 , 387–393.
6Blaser and Fryzlewicz (2016) Blaser, R. and Fryzlewicz, P. (2016) Random rotation ensembles. The Journal of Machine Learning Research , 17 , 126–151.
7Breiman (1996) Breiman, L. (1996) Bagging predictors. Machine Learning , 24 , 123–140.
8Breiman (2001) Breiman, L. (2001) Random forests. Machine Learning , 45 , 5–32.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Measuring the Algorithmic Convergence of Randomized Ensembles:

Abstract

keywords:

keywords:

1 Introduction

1.1 Background and setup

Randomized regression ensembles.

Algorithmic convergence.

The problem to be solved.

1.2 Related work and contributions

Contributions.

Outline.

2 Methodology

2.1 Measuring convergence with respect to mean-squared error

Using hold-out or out-of-bag samples.

2.2 Measuring convergence with respect to variable importance

Uniform convergence across variables.

Setup for variable importance.

The bootstrap method for variable importance.

3 Main result

Notation.

Assumptions.

Theorem 3.1**.**

Remarks.

4 Computation and speedups

4.1 Cost comparison

The cost of Algorithm 1.

The cost of Algorithm 2.

4.2 Further reduction of cost by extrapolation

A basic version of extrapolation.

Bias-corrected extrapolation.

Remark.

5 Numerical results

5.1 Organization of experiments

Data preparation.

Computing the true quantile curves q1−α(t)q_{1-\alpha}(t)q1−α​(t) and q1−α(t){\tt{q}}_{1-\alpha}(t)q1−α​(t).

Applying the bootstrap algorithms with extrapolation.

5.2 Numerical results for mean-squared error

Organization of the plots.

Remarks on performance.

5.3 Numerical results for variable importance

Outline of proofs.

Notation and conventions.

Appendix A High-level proof of Theorem 3.1

Proposition A.1**.**

Proof.

Appendix B Primary lemmas

Lemma B.1**.**

Proof.

Remark.

Lemma B.2**.**

Proof.

Lemma B.3**.**

Proof.

Appendix C Secondary lemmas

Remark.

Lemma C.1**.**

Proof.

Lemma C.2**.**

Proof.

Lemma C.3**.**

Proof.

Remark.

Lemma C.4**.**

Proof.

Appendix D Background results

Lemma D.1**.**

Lemma D.2**.**

Lemma D.3**.**

Proof.

Theorem 3.1.

Computing the true quantile curves $q_{1-\alpha}(t)$ and ${\tt{q}}_{1-\alpha}(t)$ .

Proposition A.1.

Lemma B.1.

Lemma B.2.

Lemma B.3.

Lemma C.1.

Lemma C.2.

Lemma C.3.

Lemma C.4.

Lemma D.1.

Lemma D.2.

Lemma D.3.