Two-stage Best-scored Random Forest for Large-scale Regression

Hanyuan Hang; Yingyi Chen; Johan A.K. Suykens

arXiv:1905.03438·stat.ML·May 10, 2019

Two-stage Best-scored Random Forest for Large-scale Regression

Hanyuan Hang, Yingyi Chen, Johan A.K. Suykens

PDF

Open Access

TL;DR

This paper introduces a two-stage best-scored random forest method for large-scale regression that achieves near-optimal learning rates, supports parallel computation, and integrates various regression strategies, validated by extensive experiments.

Contribution

The paper presents a novel two-stage random forest framework with best-scored selection, enhancing accuracy, efficiency, and flexibility for large-scale regression tasks.

Findings

01

Achieves almost optimal learning rates.

02

Enables parallel computing for efficiency.

03

Outperforms state-of-the-art methods on large datasets.

Abstract

We propose a novel method designed for large-scale regression problems, namely the two-stage best-scored random forest (TBRF). "Best-scored" means to select one regression tree with the best empirical performance out of a certain number of purely random regression tree candidates, and "two-stage" means to divide the original random tree splitting procedure into two: In stage one, the feature space is partitioned into non-overlapping cells; in stage two, child trees grow separately on these cells. The strengths of this algorithm can be summarized as follows: First of all, the pure randomness in TBRF leads to the almost optimal learning rates, and also makes ensemble learning possible, which resolves the boundary discontinuities long plaguing the existing algorithms. Secondly, the two-stage procedure paves the way for parallel computing, leading to computational efficiency. Last but not…

Tables1

Table 1. Table 1 : Mean Squared Error/ Training Time (seconds) on Test Data

MSD*	$(515345, 90)$	80.33	$3474.6$	$-, -$	$\geq 36$ h	$85.10$	$419.29$
Datasets	$(n, d)$	TBRF		PK		VP-SVM
Datasets	$(n, d)$	MSE	Time	MSE	Time	MSE	Time
TCO*	$(48331, 2)$	8.94	40.99	$10.77$	$111.19$	$10.85$	$48.65$
PTS	$(45730, 9)$	13.43	27.48	$18.76$	$255.31$	$14.21$	$60.06$
SARCOS	$(48933, 21)$	1.20	$32.26$	$2.04$	$158.69$	2.94	27.94
AEP	$(19735, 27)$	5261.9	9.44	$8194.3$	$109.49$	$7037.9$	$11.79$
HPP	$(22784, 8)$	1226.7	14.41	$1299.3$	$195.62$	$1273.8$	$18.29$
MSD*	$(515345, 90)$	81.11	326.85	$-, -$	$\geq 36$ h	$85.10$	$419.29$

Equations239

R_{L, P} (f) := \int_{X \times Y} L (Y, f (X)) d P (X, Y),

R_{L, P} (f) := \int_{X \times Y} L (Y, f (X)) d P (X, Y),

R_{L, D} (f) := \frac{1}{n} i = 1 \sum n L (Y_{i}, f (X_{i})),

R_{L, D} (f) := \frac{1}{n} i = 1 \sum n L (Y_{i}, f (X_{i})),

R_{L, P}^{*} := in f {R_{L, P} (f) ∣ f : X \to Y measuarable} .

R_{L, P}^{*} := in f {R_{L, P} (f) ∣ f : X \to Y measuarable} .

f_{L, P}^{*} = E_{P} (Y \arrowvert X)

f_{L, P}^{*} = E_{P} (Y \arrowvert X)

\displaystyle I_{j}:=\big{\{}i\in\{1,\ldots,n\}:x_{i}\in V_{j}\big{\}},\quad\quad j=1,\ldots,m,

\displaystyle I_{j}:=\big{\{}i\in\{1,\ldots,n\}:x_{i}\in V_{j}\big{\}},\quad\quad j=1,\ldots,m,

\displaystyle D_{j}:=\big{\{}(x_{i},y_{i})\in D:i\in I_{j}\big{\}},\quad\quad j=1,\ldots,m.

\displaystyle D_{j}:=\big{\{}(x_{i},y_{i})\in D:i\in I_{j}\big{\}},\quad\quad j=1,\ldots,m.

L_{j} (Y, f (X)) := 1_{V_{j}} (X) L (Y, f (X))

L_{j} (Y, f (X)) := 1_{V_{j}} (X) L (Y, f (X))

h_{Z, p} (X) := \frac{\sum _{i = 1}^{n} Y _{i} 1 _{{X_{i} \in A_{Z, p} (X)}}}{\sum _{i = 1}^{n} 1 _{{X_{i} \in A_{Z, p} (X)}}} 1_{E_{Z, p} (X)},

h_{Z, p} (X) := \frac{\sum _{i = 1}^{n} Y _{i} 1 _{{X_{i} \in A_{Z, p} (X)}}}{\sum _{i = 1}^{n} 1 _{{X_{i} \in A_{Z, p} (X)}}} 1_{E_{Z, p} (X)},

\displaystyle E_{Z,p}(X)=\bigg{\{}\sum_{i=1}^{n}\boldsymbol{1}_{\{X_{i}\in A_{Z,p}(X)\}}\neq 0\bigg{\}}.

\displaystyle E_{Z,p}(X)=\bigg{\{}\sum_{i=1}^{n}\boldsymbol{1}_{\{X_{i}\in A_{Z,p}(X)\}}\neq 0\bigg{\}}.

\displaystyle\mathcal{T}_{j}:=\biggl{\{}\sum_{i=0}^{p}c_{i}\boldsymbol{1}_{A_{i}}:p\in\mathbb{N},c_{i}\in[-M,M],\bigcup_{i=0}^{p}A_{i}=V_{j},A_{s}\cap A_{s^{\prime}}=\emptyset,s\neq s^{\prime}\biggr{\}}.

\displaystyle\mathcal{T}_{j}:=\biggl{\{}\sum_{i=0}^{p}c_{i}\boldsymbol{1}_{A_{i}}:p\in\mathbb{N},c_{i}\in[-M,M],\bigcup_{i=0}^{p}A_{i}=V_{j},A_{s}\cap A_{s^{\prime}}=\emptyset,s\neq s^{\prime}\biggr{\}}.

\displaystyle\mathcal{T}_{Z_{js},p}:=\bigg{\{}\sum_{i=0}^{p}c_{i}\boldsymbol{1}_{A_{i}}:c_{i}\in[-M,M],A_{i}\in\mathcal{A}_{Z_{js},p}\bigg{\}},

\displaystyle\mathcal{T}_{Z_{js},p}:=\bigg{\{}\sum_{i=0}^{p}c_{i}\boldsymbol{1}_{A_{i}}:c_{i}\in[-M,M],A_{i}\in\mathcal{A}_{Z_{js},p}\bigg{\}},

g (X) := {h (X), 0, X \in V_{j}, X \in / V_{j},

g (X) := {h (X), 0, X \in V_{j}, X \in / V_{j},

\hat{T}_{j} := {g : h \in T_{j}} .

\hat{T}_{j} := {g : h \in T_{j}} .

\hat{T}_{Z_{j s}, p} := {g : h \in T_{Z_{j s}, p}} .

\hat{T}_{Z_{j s}, p} := {g : h \in T_{Z_{j s}, p}} .

R_{L, D} (f_{D}) + Ω (f_{D}) = f \in T in f R_{L, D} (f) + Ω (f)

R_{L, D} (f_{D}) + Ω (f_{D}) = f \in T in f R_{L, D} (f) + Ω (f)

p \in N min g \in \hat{T}_{Z_{j s}, p} min λ_{j} p^{2} + R_{L_{j}, D} (g), s = 1, \dots, k_{j} .

p \in N min g \in \hat{T}_{Z_{j s}, p} min λ_{j} p^{2} + R_{L_{j}, D} (g), s = 1, \dots, k_{j} .

p \in N min g \in \hat{T}_{Z_{j s}, p} min λ_{j} p^{2} + R_{L_{j}, D} (g) \leq R_{L_{j}, D} (0) \leq M^{2},

p \in N min g \in \hat{T}_{Z_{j s}, p} min λ_{j} p^{2} + R_{L_{j}, D} (g) \leq R_{L_{j}, D} (0) \leq M^{2},

(g_{Z_{j s}}, p_{Z_{j s}}) = p \in N arg min g \in \hat{T}_{Z_{j s}, p} arg min λ_{j} p^{2} + R_{L_{j}, D} (g), s = 1, \dots, k_{j},

(g_{Z_{j s}}, p_{Z_{j s}}) = p \in N arg min g \in \hat{T}_{Z_{j s}, p} arg min λ_{j} p^{2} + R_{L_{j}, D} (g), s = 1, \dots, k_{j},

(g_{Z_{j s}}^{*}, p_{Z_{j s}}^{*}) = p \in N arg min g \in \hat{T}_{Z_{j s}, p} arg min λ_{j} p^{2} + R_{L_{j}, P} (g), s = 1, \dots, k_{j} .

(g_{Z_{j s}}^{*}, p_{Z_{j s}}^{*}) = p \in N arg min g \in \hat{T}_{Z_{j s}, p} arg min λ_{j} p^{2} + R_{L_{j}, P} (g), s = 1, \dots, k_{j} .

(h_{L, D_{j}, Z_{j s}}, p_{L, D_{j}, Z_{j s}}) := p \in N arg min h \in T_{Z_{j s}, p} arg min \tilde{λ}_{j} p^{2} + R_{L, D_{j}} (h), s = 1, \dots, k_{j} .

(h_{L, D_{j}, Z_{j s}}, p_{L, D_{j}, Z_{j s}}) := p \in N arg min h \in T_{Z_{j s}, p} arg min \tilde{λ}_{j} p^{2} + R_{L, D_{j}} (h), s = 1, \dots, k_{j} .

(g_{Z_{j}}, p_{Z_{j}}) = s = 1, \dots, k_{j} arg min λ_{j} p_{Z_{j s}}^{2} + R_{L_{j}, D} (g_{Z_{j s}}),

(g_{Z_{j}}, p_{Z_{j}}) = s = 1, \dots, k_{j} arg min λ_{j} p_{Z_{j s}}^{2} + R_{L_{j}, D} (g_{Z_{j s}}),

\hat{T}_{Z_{j}} := s = 1 ⋃ k_{j} \hat{T}_{Z_{j s}} .

\hat{T}_{Z_{j}} := s = 1 ⋃ k_{j} \hat{T}_{Z_{j s}} .

g \in \hat{T}_{Z_{j}} min λ_{j} p^{2} (g) + R_{L_{j}, D} (g) := s = 1, \dots, k_{j} min p \in N min g \in \hat{T}_{Z_{j s}, p} min λ_{j} p^{2} + R_{L_{j}, D} (g) .

g \in \hat{T}_{Z_{j}} min λ_{j} p^{2} (g) + R_{L_{j}, D} (g) := s = 1, \dots, k_{j} min p \in N min g \in \hat{T}_{Z_{j s}, p} min λ_{j} p^{2} + R_{L_{j}, D} (g) .

(g_{Z_{j}}^{*}, p_{Z_{j}}^{*})

(g_{Z_{j}}^{*}, p_{Z_{j}}^{*})

= s = 1, \dots, k_{j} arg min λ (p_{Z_{j s}}^{*})^{2} + R_{L_{j}, P} (g_{Z_{j s}}^{*}) .

g_{Z} (X) := j = 1 \sum m g_{Z_{j}} (X),

g_{Z} (X) := j = 1 \sum m g_{Z_{j}} (X),

g (X) := {h (X), 0, X \in V, X \in X \ V .

g (X) := {h (X), 0, X \in V, X \in X \ V .

p_{\hat{T}_{V}} := p_{T_{V}} .

p_{\hat{T}_{V}} := p_{T_{V}} .

T := \hat{T}_{V} \oplus \hat{T}_{W}

T := \hat{T}_{V} \oplus \hat{T}_{W}

p_{T} := (λ_{V} p_{\hat{T}_{V}}^{2} + λ_{W} p_{\hat{T}_{W}}^{2})^{1/2} .

p_{T} := (λ_{V} p_{\hat{T}_{V}}^{2} + λ_{W} p_{\hat{T}_{W}}^{2})^{1/2} .

\displaystyle\mathcal{T}_{Z_{J}}:=\bigoplus_{j\in J}\hat{\mathcal{T}}_{Z_{j}}=\Bigg{\{}g=\sum_{j\in J}g_{j}:g_{j}\in\hat{\mathcal{T}}_{Z_{j}}\ \text{for all}\ j\in J\Bigg{\}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Face and Expression Recognition · Neural Networks and Applications

Full text

Two-stage Best-scored Random Forest for

Large-scale Regression

\nameHanyuan Hang \[email protected]

\nameYingyi Chen \[email protected]

\addrInstitute of Statistics and Big Data

Renmin University of China

100872 Beijing, China

\nameJohan A.K. Suykens \[email protected]

\addrDepartment of Electrical Engineering, ESAT-STADIUS, KU Leuven

Kasteelpark Arenberg 10, Leuven, B-3001, Belgium

Abstract

We propose a novel method designed for large-scale regression problems, namely the two-stage best-scored random forest (TBRF). Best-scored means to select one regression tree with the best empirical performance out of a certain number of purely random regression tree candidates, and two-stage means to divide the original random tree splitting procedure into two: In stage one, the feature space is partitioned into non-overlapping cells; in stage two, child trees grow separately on these cells. The strengths of this algorithm can be summarized as follows: First of all, the pure randomness in TBRF leads to the almost optimal learning rates, and also makes ensemble learning possible, which resolves the boundary discontinuities long plaguing the existing algorithms. Secondly, the two-stage procedure paves the way for parallel computing, leading to computational efficiency. Last but not least, TBRF can serve as an inclusive framework where different mainstream regression strategies such as linear predictor and least squares support vector machines (LS-SVMs) can also be incorporated as value assignment approaches on leaves of the child trees, depending on the characteristics of the underlying data sets. Numerical assessments on comparisons with other state-of-the-art methods on several large-scale real data sets validate the promising prediction accuracy and high computational efficiency of our algorithm.

Keywords: large-scale regression, purely random tree, random forest, ensemble learning, regularized empirical risk minimization, learning theory

1 Introduction

The ever-increasing scale of modern scientific and technological data sets raises urgent requirements for learning algorithms that not only maintain desirable prediction accuracy but also have high computational efficiency (Wen et al., 2018; Guo et al., 2018; Thomann et al., 2017; Hsieh et al., 2014). However, a major challenge is that the data analysis and learning algorithms suitable for modest-sized data sets often encounter difficulties or are even infeasible to tackle large-volume data sets, which leads to the current popular research direction named large-scale regression (Collobert and Bengio, 2001; Raskutti and Mahoney, 2016). In the literature, efforts have been made to conquer the large-scale regression problems and each method has its own merits in its own regimes. Typically, the mainstream solutions come in two flavors, which are horizontal methods and vertical methods. The essence of the horizontal methods, also called distributed learning, is to partition the data set into data subsets, store them on multiple machines and allow these machines to train in parallel with each machine processing its local data to give local predictors. Then the local predictors are synthesized to give a final predictor (Zhang et al., 2013, 2015; Lin et al., 2017; Guo et al., 2017). Nevertheless, horizontal methods have to face their own problems. Specifically, what is originally needed is a global predictor defined on the whole feature space which should be trained based on all the training data via the chosen regression algorithm. However, the local predictors also defined on the whole feature space are actually trained based only on the information provided by the data subsets. In this manner, chances are high that each local predictor may be very different from the desired global predictor, let alone the synthesized final predictor.

The other category of methods to resolve large-scale regression problems is the vertical methods. Its main idea is to first partition the whole feature space (i.e. the input domain) into multiple non-overlapping cells where different partition methods (Suykens et al., 2002; Espinoza et al., 2006; Bennett and Blue, 1998; Wu et al., 1999; Chang et al., 2010) can be employed. Then, for each of the resulting cells, a predictor is training based on samples falling into that cell via regression strategies such as Gaussian process regression (Park et al., 2011; Park and Huang, 2016; Park and Apley, 2018), support vector machines (Meister and Steinwart, 2016; Thomann et al., 2017), etc. However, the long-standing boundary discontinuities have always been a headache for vertical methods for degrading the regression accuracy, and literature has committed to settling this problem. For example, Park et al. (2011) first applies Gaussian process regression to each cell of the decomposed domain with equal boundary constraints merely at a finite number of locations. After finding that this method cannot essentially solve the boundary discontinuities, they propose a solution specially for this issue in Park and Huang (2016) which constraints the predictions of local regressions to share a common boundary. To further mitigate the boundary discontinuities, recently, Park and Apley (2018) proposes Patch Kriging (PK) which improves previous work with the help of adding additional pseudo-observations to the boundaries. However, boundaries where two adjacent Gaussian processes are joined up are artificially chosen, which may have a great impact on the final predictor. Moreover, their approach is fundamentally different from the original Gaussian process which is a global method with respect to algorithm structure. Additionally, their method may not be that appropriate for parallel computing. Another vertical method called the Voronoi partition support vector machine (VP-SVM) (Meister and Steinwart, 2016) is available for parallel computing, while boundary discontinuities are not demonstrably solved. Besides, their method also no longer shares the same spirit as the original global algorithm LS-SVMs (Suykens et al., 2002). To the best of our knowledge, up till now, there is no such algorithm that not only overcomes the boundary discontinuities problems that long plague the vertical methods, but also takes full advantage of the huge parallel computing resources brought by the big data era to obtain results both efficient and effective.

Aiming at solving these tough problems, in this paper, we propose a novel vertical algorithm named the two-stage best-scored random forest, which is an exact fit for solving large-scale regression problems. To be specific, in stage one, the feature space is partitioned following an adaptive random splitting criterion into a number of cells, which paves the way for parallel computing. In stage two, splits are continuously conducted on each cell separately following a purely random splitting criterion. Due to the inherent randomness of this splitting criterion, for each cell, we are able to establish different regression trees under different partitions, and then pick up the one with the best empirical performance to be the child best-scored random tree of that cell. Accordingly, we name this selection strategy the “best-scored” method. Subsequently, the concatenation of child best-scored random trees from all cells forms a parent best-scored random tree. By following the above construction procedure repeatedly, we are able to establish a number of different parent best-scored random trees whose ensemble is just the two-stage best-scored random forest. The prominent strengths of our algorithm over other vertical methods can be demonstrated from the following perspectives:

(i) In most of the existing vertical methods, the feature space is usually artificially partitioned into different non-overlapping cells and the original algorithm is then applied to each of these regions, respectively. In the original algorithm, the prediction of any point in the feature space is influenced by the information of all the sample points, whereas in the corresponding vertical methods, the prediction of any point may be only affected by the information of sample points in its belonging cell. This usually leads to an essential change of the algorithm structure and accordingly, the global smoothness of the original method is jeopardized and only the smoothness within each cell can now be guaranteed, often resulting in the boundary discontinuity problem. In contrast, this is never a problem for our two-stage best-scored random forest (TBRF) method, since random forest (RF) is intrinsically an ensemble method bringing its asymptotic smoothness. As for our two-stage random forest method, we only divide the whole original splitting process of one tree into two stages for the sake of parallelism. This does not change the nature of TBRF as an RF method.

(ii) Owing to the two-stage structure of our proposed algorithm and the architecture of the random forests, the TBRF achieves satisfying performance in terms of computational efficiency and prediction accuracy, which have always had great significance in the big data era. Specifically, the computational efficiency is twofold. First of all, the algorithm can be significantly sped up by leveraging parallel computing in both stages. Considering that parent trees in the forest require different adaptive random partitions of the feature space which are conducted in stage one, we can assign each adaptive partition to a different core for acceleration. This is a direct advantage of parallelism brought by the ensemble learning resided in the random forest. Moreover, the establishment of child best-scored random trees whose total number is the total amount of cells in all parent trees, can be also assigned to different cores, so that the computational burden can be decentralized. For another, the adaptive random partition in stage one is completely data-driven and this splitting mechanism makes the number of samples falling into each cell more evenly distributed. Therefore, it increases the number of effective splits, and further reduces the training time for parallel computing. When it comes to the prediction accuracy, we manage to incorporate some existing mainstream regression algorithms as value assignment methods into our random forest architecture. In addition to only assigning a constant to each terminal node of the trees, we employ a few alternatives, such as fitting linear regression functions for low dimensional data, and utilizing a Gaussian kernel for high dimensional data, due to their different performances when encountering different dimensional data. Numerical experiments further demonstrate the effectiveness in choosing appropriate assignment strategies for different data. Moreover, the asymptotic smoothness brought by the ensemble learning and the property of having many tunable hyperparameters further contribute to the improvement of accuracy.

(iii) The satisfactory performance of the two-stage best-scored random forest is supported by compact theoretical analysis under the framework of regularized empirical risk minimization. To be specific, by decomposing the error term into data-free and data-dependent error terms which are dealt with by techniques from approximation theory and empirical process theory, respectively, we establish the almost optimal learning rates for both parent best-scored random trees and their ensemble forest under certain mild assumptions on the smoothness of the target functions.

The paper is organized as follows: Section 2 is dedicated to the explanation on the algorithm architecture. We present the main results and statements on the almost optimal learning rates in Section 3 with the corresponding error analysis lucidly demonstrated in Section 4. Architecture analysis and empirical assessments of comparisons between different vertical methods based on real data sets are provided in Sections 5 and 6. For the sake of clarity, all the proofs of Section 3 and Section 4 are presented in Section 7. Finally, we conclude this paper in Section 8.

2 Establishment of the Main Algorithm

In this section, we propose a new random forest method for regression which gathers the advantages of vertical methods and ensemble learning. A lucid illustration requires to break down the algorithm into four steps. First, we adopt an adaptive random partition method to split the feature space into several cells in stage one. Second, by building the best-scored random tree for regression on each cell in stage two and gathering them together, we are able to obtain a parent random tree. Third, due to the intrinsic randomness of the partition method, we are able to establish a certain number of parent random trees under different partitions of the feature space. Last but not least, by combining these parent random trees to form an ensemble, we obtain the Two-stage Best-scored Random Forest.

2.1 Notations

The goal in a supervised learning problem is to predict the value of an unobserved output variable $Y$ after observing the value of an input variable $X$ . To be exact, we need to derive a predictor $f$ which maps the observed input value of $X$ to a prediction $f(X)$ of the unobserved output value of $Y$ . The choice of predictor should be based on the training data $D:=((X_{1},Y_{1}),\ldots,(X_{n},Y_{n}))$ of i.i.d observations, which are with the same distribution as the generic pair $(X,Y)$ , drawn from an unknown probability measure $\mathrm{P}$ on $\mathcal{X}\times\mathcal{Y}$ . We assume that $\mathcal{X}\subset\mathbb{R}^{d}$ is non-empty, $\mathcal{Y}:=[-M,M]$ for some $M>0$ and $\mathrm{P}_{X}$ is the marginal distribution of $X$ .

According to the learning target, it is legitimate to consider the least squares loss $L=L_{LS}:\mathcal{X}\times\mathcal{Y}\to[0,\infty)$ defined by $L(Y,f(X)):=(Y-f(X))^{2}$ . Then, for a measurable decision function $f:\mathcal{X}\to\mathcal{Y}$ , the risk is defined by

[TABLE]

and the empirical risk is defined by

[TABLE]

where $\mathrm{D}:=\frac{1}{n}\sum_{i=1}^{n}\delta_{(X_{i},Y_{i})}$ is the empirical measure associated to data and $\delta_{(X_{i},Y_{i})}$ is the Dirac measure at $(X_{i},Y_{i})$ . The Bayes risk which is the minimal risk with respect to $\mathrm{P}$ and $L$ can be given by

[TABLE]

In addition, a measurable function $f^{*}_{L,\mathrm{P}}:\mathcal{X}\to\mathcal{Y}$ with $\mathcal{R}_{L,\mathrm{P}}(f^{*}_{L,\mathrm{P}})=\mathcal{R}^{*}_{L,\mathrm{P}}$ is called a Bayes decision function. By minimizing the risk, the Bayes decision function is

[TABLE]

which is a $\mathrm{P}_{X}$ -almost surely $[-M,M]$ -valued function.

In order to achieve our two-stage random forest for regression, we first consider the development of the parent random forest under one specific feature space partition. Therefore, we assume that $(V_{j})_{j=1,\ldots,m}$ is a partition of $\mathcal{X}$ such that none of its cells is empty, which is $V_{j}\neq\emptyset$ for every $j\in\{1,\ldots,m\}$ . To present our approach in a clear and rigorous mathematical expression, there is a need for us to introduce some more definitions and notations. First of all, the index set is defined as

[TABLE]

which indicates the samples of $D$ contained in $V_{j}$ and also the corresponding data set

[TABLE]

Additionally, for every $j\in\{1,\ldots,m\}$ , the loss $L_{j}:\mathcal{X}\times\mathcal{Y}\to[0,\infty)$ on the corresponding cell $V_{j}$ is defined by

[TABLE]

where $L(Y,f(X))$ is the least squares loss for our regression problem.

2.2 Best-scored Random Trees

One crucial step of the two-stage best-scored random forest algorithm is building parent best-scored random trees under certain partitions of the feature space. Therefore, we first focus on the development of one parent tree which is the summation of child trees. An appropriate splitting approach of the feature space is inseparable for the tree establishment. Therefore, we introduce a random partition method in our case.

2.2.1 Purely Random Partition

Purely random forest put forward by Breiman (2000) is an algorithm parallel to forests based on well-known splitting criteria such as information gain (Quinlan, 1986), information gain ratio (Quinlan, 1993) and Gini index (Breiman et al., 1984). Since it is widely acknowledged that forest established by the latter three criteria are not universal consistent, while consistency can be obtained by the first one, we base our forest on this purely random splitting criterion.

A clear illustration of the splitting mechanism at the $i$ -th step of one possible random tree construction requires a random vector $Q_{i}:=(L_{i},R_{i},S_{i})$ . The first term $L_{i}$ in the triplet denotes the leaf to be split at the $i$ -th step chosen from all the leaves presented in the $(i-1)$ -th step. The second term in the triplet $R_{i}\in\{1,\ldots,d\}$ represents the dimension chosen to be split from for the $L_{i}$ leaf. Moreover, $\{R_{i},i\in\mathbb{N}_{+}\}$ are i.i.d. multinomial random variables with all dimension having equal probability to be split from. The third term $S_{i}$ stands for the ratio of the length in the $R_{i}$ -th dimension of the newly generated leaf after the $i$ -th split to the length in the $R_{i}$ -th dimension of leaf $L_{i}$ , which is a proportional factor. In this manner, the length in the $R_{i}$ -th dimension of the newly generated leaf can be calculated by multiplying the length in the $R_{i}$ -th dimension of leaf $L_{i}$ and the proportional factor $S_{i}$ . We mention here that $\{S_{i},i\in\mathbb{N}_{+}\}$ are independent and identically distributed from $\mathcal{U}(0,1)$ .

To provide more insight into the above mathematical formulation of the splitting process of the purely random tree, we take the tree construction on $A=[0,1]^{d}$ as a simple example, which is the same for construction on $V_{j}$ . One specific construction procedure is shown in Figure 1 where we take $d=2$ . First of all, we pick up a dimension out of $d$ candidates randomly, and then split uniformly at random from that dimension. The resulting split being a $(d-1)$ -dimensional hyperplane parallel to the axis partitions $A$ into two leaves, say $A_{1,1}$ and $A_{1,2}$ . Next, a leaf is chosen uniformly at random, e.g. $A_{1,1}$ , and we go on picking the dimension and the cut-point uniformly at random to implement the second split, which leads to a partition of $A$ : $A_{2,1},A_{2,2},A_{1,2}$ . When conducting the third split, we still randomly select one leaf presented in the last step, e.g. $A_{2,2}$ , and the third split is once again conducted on it as before. The resulting partition of $A$ then becomes $A_{2,1},A_{3,1},A_{3,2},A_{1,2}$ . This above recursive process will not stop until the number of splits $p$ reaches out satisfaction. Further scrutiny will find that the splitting procedure leads to a partition variable, namely $Z:=(Q_{1},\ldots,Q_{p},\ldots)$ which takes value in space $\mathcal{Z}$ . From now on, $\mathrm{P}_{Z}$ stands for the probability measure of $Z$ .

It is legitimate to assume that any specific partition variable $Z\in\mathcal{Z}$ can be recognized as a latent splitting criterion. To be specific, if we consider a $p$ -split procedure carried out by following $Z$ , then the collection of the resulting non-overlapping leaves can be defined by $\mathcal{A}_{(Q_{1}\ldots Q_{p})}$ , and further abbreviated as $\mathcal{A}_{Z,p}$ . Now, if we focus on the partition on certain cell $V_{j}$ , for example, then we have $\mathcal{A}_{Z,0}:=V_{j}$ . Moreover, for any point $X\in V_{j}$ , it is bound to fall into certain cell which can then be denoted by $A_{Z,p}(X)$ .

Here, we introduce a map $h_{Z,p}:V_{j}\to\mathcal{Y}$ defined by

[TABLE]

where the event set $E_{Z,p}(X)$ is defined by

[TABLE]

Formula (1) is called the random tree decision rule for regression on $V_{j}$ .

2.2.2 Child Best-scored Random Tree

In this subsection, we consider the establishing procedure of a child best-scored random tree defined on the feature space $\mathcal{X}$ . Specifically, the child random tree is originally developed on $V_{j}$ and then extended to $\mathcal{X}$ . Concerning with the fact that the performance of the tree obtained by conducting random partition once may not be that desirable, we improve this by choosing one tree with the best performance out of $k_{j}$ candidates on $V_{j}$ . The tree picked out is then called the child best-scored random tree. Therefore, when analyzing the behaviors of $k_{j}$ trees on $V_{j}$ , we suppose that splitting procedures they follow can be represented by the independent and identically distributed random variables $\{Z_{j1},\ldots,Z_{jk_{j}}\}$ drawn from $\mathcal{Z}$ , respectively.

For clearer illustration of theoretical analysis, we first give the definitions of some function sets. We assume that $\mathcal{T}_{j}$ is a function set containing all the possible partitions of a random tree over $V_{j}$ , which is defined as follows:

[TABLE]

Here, we choose $p\in\mathbb{N}$ as the number of splits, the resulting leaves presented as $A_{0},A_{1},\ldots,$ $A_{p}$ actually form a $p$ -split partition of $V_{j}$ . It is important to notify that $c_{i}$ is the value of leaf $A_{i}$ . Without loss of generality, in this paper, we only consider cells with the shape of $A_{i}=\bigtimes_{\ell=1}^{d}[a_{i\ell},b_{i\ell}]$ . Moreover, for $s\in\{1,\ldots,k_{j}\}$ , we derive the function set induced by the splitting policy $Z_{js}$ as

[TABLE]

where $\mathcal{A}_{Z_{js},p}$ represents the resulting $p$ -split partition of $V_{j}$ by following the splitting policy $Z_{js}$ . Note that $\mathcal{T}_{Z_{js},p}$ is a subset of $\mathcal{T}_{j}$ .

However, we should notice that every function $h\in\mathcal{T}_{j}$ is only defined on $V_{j}$ while a random tree function from $\mathcal{X}$ to $\mathcal{Y}$ is finally needed. To this end, for every $h\in\mathcal{T}_{j}$ , we define the zero-extension $g:\mathcal{X}\to\mathcal{Y}$ by

[TABLE]

which should be equipped with the same number of splits $p$ as the decision tree $h$ . Then, the function set only defined on $V_{j}$ can also be extended to $\mathcal{X}$ , that is

[TABLE]

Moreover, the extension of function set $\mathcal{T}_{Z_{js},p}$ can also be obtained with the same manner, which is

[TABLE]

Furthermore, we denote $\hat{\mathcal{T}}_{Z_{js}}:=\cup_{p\in\mathbb{N}}\hat{\mathcal{T}}_{Z_{js},p}$ .

In order to find an appropriate random tree decision rule under policy $Z_{js}$ denoted as $g_{L_{j},\mathrm{D},Z_{js}}$ , we are supposed to conduct an optimization problem. To this end, we conduct our analysis under the framework of regularized empirical risk minimization. To begin with, regularized empirical risk minimization is a learning method providing us with a better preparation for more involved analysis of our specific random forest. Let $L:\mathcal{X}\times\mathcal{Y}\to[0,\infty)$ be a loss and $\mathcal{T}\subset\mathcal{L}_{0}(\mathcal{X})$ be a non-empty set, where $\mathcal{L}_{0}$ is the set of measurable functions on $\mathcal{X}$ and $\Omega:\mathcal{T}\to[0,\infty)$ be a function. The learning method whose decision function $f_{D}$ satisfying

[TABLE]

for all $n\geq 1$ and $D\in(\mathcal{X}\times\mathcal{Y})^{n}$ is named regularized empirical risk minimization.

In this paper, we propose that the number of splits $p$ is what we should penalize on. By penalizing on $p$ , we are able to give some constraints on the complexity of the function set so that the set will have a finite VC dimension (Vapnik and Chervonenkis, 1971), and therefore make the algorithm PAC learnable (Valiant, 1984). Besides, it can also refrain the learning results from overfitting. With data set $D$ , the above regularized empirical risk minimization problem with respect to each function set $\hat{\mathcal{T}}_{Z_{js}}$ turns into

[TABLE]

It is well worth mentioning that since the exponent of $p$ will not have influence on the performance of the selection procedure, we penalize on $p^{2}$ to obtain better convergence properties.

Observation finds that the regularized empirical risk minimization under any policy can be bounded simply by considering the case where no split is applied to $V_{j}$ . Consequently, we present the optimization problem as follows:

[TABLE]

where $\mathcal{R}_{L_{j},\mathrm{D}}(0)$ stands for the empirical risk for taking $g(x)=0$ for all $x\in\mathcal{X}$ with $p=0$ . Therefore, from the above inequality, we obtain that the number of splits $p$ is upper bounded by $M\lambda_{j}^{-1/2}$ . Accordingly, the capacity of the underlying function set can be largely reduced, and here and subsequently, the function sets will all be added an extra condition where $p\leq M\lambda_{j}^{-1/2}$ .

To establish the random tree decision rule for regression on $\mathcal{X}$ , we zero-extend (1) to the whole feature space. It can be apparently observed that our random tree decision rule on $\mathcal{X}$ induced by $V_{j}$ is the solution to the optimization problem (7) and it can be further denoted by

[TABLE]

where $p_{Z_{js}}$ is the number of splits of the decision function $g_{Z_{js}}$ . Its population version is presented by

[TABLE]

It is necessary to note that our primary idea is to conduct the regularized empirical risk minimization problem using $\mathcal{T}_{Z_{js}}$ and $D_{j}$ , which is

[TABLE]

It can be observed that when we take $\tilde{\lambda}_{j}:=n\lambda_{j}/\lvert I_{j}\rvert$ , the solution of the optimization problem (9) coincides with (8) on $V_{j}$ . Since the following analysis will be carry out on $\mathcal{X}$ , we can directly optimize (8). Furthermore, it is easy to verify that if a Bayes decision function $f_{L,\mathrm{P}}^{*}$ w.r.t. $L$ and $\mathrm{P}$ exists, it additionally is a Bayes decision function w.r.t. $L_{j}$ and $\mathrm{P}$ .

Now, we focus on establishing the best-scored random tree on $\mathcal{X}$ induced by $V_{j}$ , also called the child best-scored random tree, which is chosen from $k_{j}$ candidates. The main principle is to retain only the tree yielding the minimal regularized empirical risk, which is

[TABLE]

where $p_{Z_{j}}$ is the number of splits of $g_{Z_{j}}$ and $Z_{j}=\{Z_{j1},\ldots,Z_{jk_{j}}\}$ . Apparently, $g_{Z_{j}}$ is the regularized empirical risk minimizer with respect to the random function set

[TABLE]

Put another way, $g_{Z_{j}}$ is the solution to the regularized empirical risk minimization problem

[TABLE]

Similar as it is, we denote by $g^{*}_{Z_{j}}$ the solution of the population version of regularized minimization problem in the set $\hat{\mathcal{T}}_{Z_{j}}$

[TABLE]

We mention here that $p^{*}_{Z_{j}}$ is the corresponding number of splits of $g^{*}_{Z_{j}}$ .

2.2.3 Parent Best-scored Random Tree

In this subsection, we first build the parent random tree by adding all the child ones. After that, in order to show that our parent random tree is indeed a solution of an usual random tree algorithm on the feature space, we need to consider the indicator function sets defined on $\mathcal{X}$ of a child random tree and direct sums of the indicator function sets of several trees.

First of all, adding all child best-scored random trees generated by (10) together leads to the parent best-scored random tree, which is defined by

[TABLE]

where $Z:=\{Z_{1},\ldots,Z_{m}\}$ denotes the splitting criteria on $\{V_{j}\}_{j=1}^{m}$ .

Recall that we have mention the process of extending the indicator function set of a tree on $V\subsetneq\mathcal{X}$ to an indicator function set on $\mathcal{X}$ in (4) and (5), we now give a formal description of that in the following proposition.

Proposition 1

Let $V\subset\mathcal{X}$ and $\mathcal{T}_{V}$ be an indicator function space of the form (2) on $V$ . Denote by $g$ the zero-extension of $h\in\mathcal{T}_{V}$ to $\mathcal{X}$ defined by

[TABLE]

Then, the set $\hat{\mathcal{T}}_{V}:=\{g:h\in\mathcal{T}_{V}\}$ is still an indicator function set on $\mathcal{X}$ . We define that the number of splits of the decision tree on $\hat{\mathcal{T}}_{V}$ is the same as the number of splits on $\mathcal{T}_{V}$ , which is

[TABLE]

Based on this proposition, we are now able to construct an indicator function set by a direct sum of indicator function sets $\hat{\mathcal{T}}_{V}$ and $\hat{\mathcal{T}}_{W}$ with $V,W\subset\mathcal{X}$ and $V\cap W=\emptyset$ .

Proposition 2

For $V,W\subset\mathcal{X}$ such that $V\cap W=\emptyset$ and $V\cup W\subset\mathcal{X}$ , let $\mathcal{T}_{V}$ and $\mathcal{T}_{W}$ be indicator function sets of the form (2) on $V$ and $W$ , respectively. Furthermore, let $\hat{\mathcal{T}}_{V}$ and $\hat{\mathcal{T}}_{W}$ be the indicator function sets of all functions of $\mathcal{T}_{V}$ and $\mathcal{T}_{W}$ extended to $\mathcal{X}$ in the sense of Proposition 1. Let $p_{\hat{\mathcal{T}}_{V}}$ and $p_{\hat{\mathcal{T}}_{W}}$ given by (14) be the associated the number of splits. Then $\hat{\mathcal{T}}_{V}\cap\hat{\mathcal{T}}_{W}=\{0\}$ and hence the direct sum

[TABLE]

exists. The direct sum $\mathcal{T}$ is also an indicator function set of random trees. For $\lambda_{V},\lambda_{W}>0$ and $g\in\mathcal{T}$ , let $g_{V}\in\hat{\mathcal{T}}_{V}$ and $g_{W}\in\hat{\mathcal{T}}_{W}$ be the unique function that $g=g_{V}+g_{W}$ . Then, we define the number of splits on the direct sum space by

[TABLE]

To relate Proposition 1 and Proposition 2 with (13), there is a need to introduce more notations. For pairwise disjoint $V_{1},\ldots,V_{m}\subset\mathcal{X}$ with $\cup_{j=1}^{m}V_{j}=\mathcal{X}$ , let $\hat{\mathcal{T}}_{Z_{j}}$ be the best-scored function space (11) induced by $V_{j}$ for every $j\in\{1,\ldots,m\}$ based on Proposition 1. A joined indicator function space of $\hat{\mathcal{T}}_{Z_{1}},\ldots,\hat{\mathcal{T}}_{Z_{m}}$ can be therefore designed analogously to Proposition 2. Specifically, for an arbitrary index set $J\subset\{1,\ldots,m\}$ and a vector $\boldsymbol{\lambda}=(\lambda_{j})_{j\in J}\in(0,\infty)^{\lvert J\rvert}$ , the direct sum

[TABLE]

where $Z_{J}:=\{Z_{j}:j\in J\}$ , is still an indicator function space of random tree with squared number of splits

[TABLE]

If $J=\{1,\ldots,m\}$ , we simply write $\mathcal{T}_{Z}:=\mathcal{T}_{Z_{J}}$ . To notify, $\mathcal{T}_{Z}$ contains inter alia $g_{Z}$ given by (13).

Here, we briefly investigate the regularized empirical risk of $g_{Z}=\sum_{j=1}^{m}g_{Z_{j}}$ . For arbitrary $g\in\mathcal{T}_{Z}$ , we have

[TABLE]

The first equality is derived by $\mathcal{R}_{L,\mathrm{D}}(g)=\sum_{j=1}^{m}\mathcal{R}_{L_{j},\mathrm{D}}(g)$ (Meister and Steinwart, 2016). The second equality is established because the risk of $g_{Z}$ on $V_{j}$ equals that of $g_{Z_{j}}$ . The inequality is a direct result of (10), where the number of splits $p(g)$ for arbitrary $g\in\mathcal{T}_{Z}$ according to Proposition 2 is defined by $p^{2}(g):=\sum_{j=1}^{m}\lambda_{j}p^{2}(\boldsymbol{1}_{V_{j}}g)$ , and $p(\boldsymbol{1}_{V_{j}}g)$ is the corresponding number of splits of $g$ on $V_{j}$ . The last two equalities hold the same ways as the first two ones.

Judging from (16), $g_{Z}$ is the random tree function with respect to $\mathcal{T}_{Z}$ and $L$ , as well as the regularized parameter $\lambda=1$ . In other words, the latter best-scored random tree derived from $\operatornamewithlimits{arg\,min}_{g\in\mathcal{T}_{Z}}\mathcal{R}_{L,\mathrm{D}}(g)+p^{2}(g)$ equals our parent best-scored random tree (13).

For the sake of clarity, we summarize some assumptions for the joined best-scored function sets as follows:

Assumption 3 (Joined best-scored decision tree spaces)

For pairwise disjoint subsets $V_{1},\ldots,V_{m}$ of $\mathcal{X}$ , let $\hat{\mathcal{T}}_{Z_{j}}$ be the best-scored random tree function sets induced by $V_{j}$ . Consequently, for $\boldsymbol{\lambda}:=(\lambda_{1},\ldots,\lambda_{m})\in(0,\infty)^{m}$ , we define the joined best-scored function space $\mathcal{T}_{Z}:=\bigoplus_{j=1}^{m}\hat{\mathcal{T}}_{Z_{j}}$ and equip it with the number of splits (15).

2.3 Two-stage Best-scored Random Forest

Having developed the parent random tree under one specific partition of the feature space, it is legitimate to ponder whether we can devise an ensemble of trees by injecting randomness into the feature partition in stage one. To fulfill this idea, we propose a data splitting approach named as the adaptive random partition and establish the Two-stage Best-scored Random Forest by ensemble learning.

2.3.1 Adaptive Random Partition of the Feature Space

To describe the above two-stage random forest algorithm, $(V_{j})_{j=1,\ldots,m}$ only has to be some partition of $\mathcal{X}$ . Nevertheless, concerning with the theoretical investigations that will be conducted on the learning rates of our new algorithm, there is a need for us to further specify the partition. For this purpose, we denote a series of balls $B_{1},\ldots,B_{m}$ with radius $r_{j}>0,j=1,\ldots,m$ and mutually distinct centers $z_{1},\ldots,z_{m}\in\mathcal{X}$ by

[TABLE]

where $\|\cdot\|_{2}$ is the Euclidean norm in $\mathbb{R}^{d}$ . Furthermore, we can choose $r_{1},\ldots,r_{m}$ and $z_{1},\ldots,z_{m}$ such that $\mathcal{X}\subset\bigcup_{j=1}^{m}B_{j}$ .

Considering how large the sample size will be and how the sample density may vary in the feature space $\mathcal{X}$ , we propose an adaptive random partition approach. This method serves as a preprocessing of partitioning the feature space into cells containing fewer data which facilitates the following regression works on cells. Moreover, owing to the randomness resided in the partition, it paves the way for ensemble. A considerable advantage of this proposal over the purely random partition is that it efficiently takes the sample information into consideration. To be precise, since the construction of the purely random partition is independent of the whole data set, it may possibly suffer from the dilemma where there is over-splitting on sample-sparse area and under-splitting on sample-dense area. However, the adaptive random partition is much wiser for it utilizes sample information in a relatively easy way and still fulfills the objective of dividing the space into small cells. The specific partition procedure is similar to the proposed process in Section 2.2.1 with difference in how to choose the to-be-split cell.

In the purely random partition, $L_{i}$ in the random vector $Q_{i}:=(L_{i},R_{i},S_{i})$ denotes the randomly chosen cell to be split at the $i$ -th step of tree construction. Here, we propose that when choosing a to-be-split cell, we first randomly select $t$ sample points from the training data set who are then labeled by the cells they belong to. Later, we choose the cell that is the majority vote of the $t$ sample labels to be $L_{i}$ . This idea follows the fact that when randomly picking sample points from the whole training data set, cells with more samples will be more likely to be selected while cells with fewer samples are less possible to be chosen. In this manner, we may obtain feature space partitions where the sample sizes of resulting cells are more evenly distributed.

2.3.2 Ensemble Forest

We now construct the two-stage best-scored random forest basing on the average results of $T$ parent best-scored random trees. Due to the intrinsic randomness resided in the partition method, we are able to construct several different parent best-scored random trees under different partitions of the feature space. To be specific, each of these trees is generated according to the procedure in (13) under different input partition $V^{t}:=\{V_{j}^{t}\}_{j=1}^{m}$ , $t=1,\ldots,T$ . To clarify, the splitting criterion for each of the tree in the forest is denoted by $Z_{t}=\{Z_{1t},\ldots,Z_{mt}\}$ , $t=1,\ldots,T$ , where $Z_{jt}$ is already the splitting criterion corresponding to the child best-scored random tree for the $t$ -th tree on its $V_{j}^{t}$ . Moreover, we denote the parent best-scored trees in the forest as $g_{Z_{t}},\ 1\leq t\leq T$ . As usual, we perform average to obtain the two-stage best-scored random forest decision rule

[TABLE]

where $Z=\{Z_{1},\ldots,Z_{T}\}$ denotes the collection of all splitting criteria of trees in the forest. Finally, we establish our large-scale regression predictor, the two-stage best-scored random forest $f_{Z}$ .

3 Main Results and Statements

In this section, we present main results on the oracle inequalities and learning rates for the random trees and forests.

3.1 Fundamental Assumption

In this paper, we are interested in the ground-truth functions that satisfy the following restrictions on their smoothness:

Assumption 4

The Bayes decision function $f_{L,\mathrm{P}}^{*}:\mathcal{X}\to\mathcal{Y}$ is $\alpha$ -Hölder continuous with respect to $L_{1}$ -norm $\|x\|_{1}:=\sum_{i=1}^{d}|x_{i}|$ . That is, there exists a constant $c_{\alpha}>0$ such that

[TABLE]

3.2 Oracle Inequality for Parent Best-scored Random Trees

We now establish an oracle inequality for parent best-scored random trees based on the least squares loss and best-scored function space.

Theorem 5

Let $\mathcal{Y}:=[-M,M]$ for $M>0$ , $L:\mathcal{X}\times\mathcal{Y}\to[0,\infty)$ be the least squares loss, $\mathrm{P}_{X\times Y}:=\mathrm{P}$ be the probability measure on $\mathcal{X}\times\mathcal{Y}$ and $\mathrm{P}_{Z}$ be the probability measure induced by the splitting criterion $Z$ . Then for all $\tau>0$ , $\lambda:=(\lambda_{1},\ldots,\lambda_{m})>0$ and $\delta\in(0,1)$ , the parent best-scored random tree (13) satisfies

[TABLE]

with probability $\mathrm{P}_{(X\times Y)|Z}$ at least $1-3e^{-\tau}$ , where $c_{d\delta M}$ is a constant depending on $d$ , $\delta$ and $M$ . The result holds for all parent best-scored random tree criterion $Z$ .

3.3 Learning Rates for Parent Best-scored Random Trees

We now state our main result on the learning rates for parent best-scored random trees based on the established oracle inequality.

Theorem 6

Let $L:\mathcal{X}\times\mathcal{Y}\to[0,\infty)$ be the least squares loss, $\mathrm{P}_{X\times Y}:=\mathrm{P}$ be the probability measure on $\mathcal{X}\times\mathcal{Y}$ and $\mathrm{P}_{Z}$ be the probability measure induced by the splitting criterion $Z$ . Let $\{V_{j}\in B_{j}\}_{j=1}^{m}$ be a partition of $\mathcal{X}$ and $k_{j}$ be the number of candidate trees on $V_{j}$ . Suppose that the Bayes decision function $f_{L,\mathrm{P}}^{*}:\mathcal{X}\to\mathcal{Y}$ satisfies Assumption 4 with exponent $\alpha$ . Then for all $\tau>0$ and $\delta\in(0,1)$ , with probability $\mathrm{P}_{(X\times Y)\otimes Z}$ at least $1-4e^{-\tau}$ , there holds for the parent best-scored random tree (13) that

[TABLE]

where $c_{T}=0.22$ and $C$ depending on $\alpha,\tau,\delta,d,m,M$ and $\{r_{j},k_{j},\mathrm{P}_{X}(V_{j})\}_{j=1}^{m}$ .

3.4 Learning Rates for Two-stage Best-scored Random Forest

We now present the main result on the learning rates for two-stage best-scored random forest in (17). This diverse and also accurate ensemble forest is based on the collection of parent best-scored random trees generated by different feature space partition.

Theorem 7

Let $L:\mathcal{X}\times\mathcal{Y}\to[0,\infty)$ be the least squares loss, $\mathrm{P}_{X\times Y}:=\mathrm{P}$ be the probability measure on $\mathcal{X}\times\mathcal{Y}$ and $\mathrm{P}_{Z}$ be the probability measure induced by the splitting criterion $Z$ . Let the collection of $T$ different partitions that generate the ensemble be $V:=\{V^{t}\}_{t=1}^{T}:=\{\{V_{j}^{t}\}_{j=1}^{m}\}_{t=1}^{T}$ and $k_{j}^{t}$ be the number of candidate trees on $V_{j}^{t}$ . Suppose that the Bayes decision function $f_{L,\mathrm{P}}^{*}:\mathcal{X}\to\mathcal{Y}$ satisfies Assumption 4 with exponent $\alpha$ . Then, for all $\tau>0$ and $\delta\in(0,1)$ , with probability $\mathrm{P}_{(X\times Y)\otimes Z}$ at least $1-4e^{-\tau}$ , there holds

[TABLE]

where $C$ depending on $\alpha,\tau,\delta,d,m,M,T,\{\{r_{j}^{t},k_{j}^{t},\mathrm{P}_{X}(V_{j}^{t})\}_{j=1}^{m}\}_{t=1}^{T}$ and $c_{T}=0.22$ .

According to the proof related to Theorem 7, we find that the coefficient $C$ may decrease with the number of trees in the forest $T$ increasing. In other words, in theory, more trees may lead to smoother forest predictor and therefore, better learning rates. Moreover, this phenomenon is also supported by the experimental results shown later in Figure 4 where the predictor becomes smoother and has a better fit when $T$ increases.

3.5 Comments and Discussions

In this subsection, we present some comments and discussions on the obtained theoretical results on the oracle inequality, learning rates for the parent random trees and then for the two-stage best-scored random forest.

We highlight that our two-stage best-scored random forest algorithm aims at dealing with regression problems with enormous amount of data. To begin with, in the literature, vertical methods to deal with large-scale regression problem have gained its popularity owing to its capability of parallel computing. In this paper, we adopt a decision-tree like feature space splitting criterion named the adaptive random partition which is defined as the partition in stage one. Moreover, the following partitions for conducting random trees on the resulting cells from stage one is called the partitions in stage two, and they follow a purely random splitting criterion. In the literature, classical splitting criteria such as information gain, information gain ratio and Gini index have been scrutinized mostly from the perspective of experimental performance, while there are only a few of them concerning with theoretical learning rates, such as Biau (2012) and Scornet et al. (2015). However, the conditions under which their learning rates are derived are too strong to testified in practical. Compared to these classical splitting criteria, our purely random splitting criterion achieves satisfying learning rates only with some descriptions of the smoothness of the Bayes decision functions.

Second, we propose a novel idea in our model selection process, which is denominated as the best-scored method. To clarify, choosing one random tree with the best regression performance out of several candidates helps to improve the accuracy of the base predictors. For a certain order of number of splits $p$ , when the number of candidates $k$ is large enough, the function space generated by those trees will also be large enough to cover sufficient possible partition results. Consequently, the probability is high for us to choose the random tree with the best performance, which will lead to a remarkably small approximation error.

Third, the learning rate of one parent best-scored random tree is $O(n^{-c_{T}\alpha/(c_{T}\alpha(1+\delta)+2d)})$ and the learning rate of the two-stage best-scored random forest is with the same order. Here, we should notice that due to the intrinsic randomness of our splitting criterion, for a $p$ -split random tree, the effective number of splits for each dimension is approximately $c_{T}\log p$ rather than $\log p$ , where we take $c_{T}=0.22$ . Moreover, since $\delta$ is concerned with the capacity of the partition function space and our function space is not that large, we can take $\delta$ as small as possible, even close to [math].

In the machine learning literature, all kinds of vertical or horizontal regression methods have been studied extensively and understood. For example, a vertical-like method mixing $k$ -NN and SVM for regression is theoretically scrutinized by Hable (2013). In his paper, for every testing point, the global SVM is applied to the $k$ nearest neighbors instead of to the whole training data. Moreover, a universal risk-consistency is provided. In Meister and Steinwart (2016), they derive the learning rate $O(n^{-2\alpha/(2\alpha+d)+\xi})$ of the localized SVM when the Bayes decision function is in a Besov-like space with $\alpha$ -degrees of smoothness. As for large-scale regression problem with horizontal methods, Zhang et al. (2015) proposes a divide and conquer kernel ridge regression and provide learning rates with respect to different kernels. With the Bayes decision function in the corresponding reproducing kernel Hilbert space (RKHS), they obtain a learning rate of $O(r/n)$ for kernels with finite rank $r$ and a learning rate of $O(n^{-2\nu/(2\nu+1)})$ for kernels in a Sobolev space with $\nu$ -degrees of smoothness. Both of these prediction rates are minimax-optimal learning rates. Lin et al. (2017) conducts a distributed learning with the least squares regularization scheme in a RKHS and obtains the almost optimal learning rates in expectation which are $O(n^{-2\alpha r/(4\alpha r+1)})$ . The learning rates are established under the smoothness assumption with respect to the $r$ -th power of the integral operator $L_{k}$ and an $\alpha$ -related capacity assumption. Guo et al. (2017) focuses on the distributed regression with bias corrected regularization kernel network and also obtains a learning rates of order $O(n^{-2r/(2r+\beta)})$ where $\beta$ is the capacity related parameter. According to the above analysis, it can be seen that the work presented in our study has not only innovations but also complete theoretical supports.

4 Error Analysis

In this section, we give error analysis by bounding the approximation error term and the sample error term, respectively.

4.1 Bounding the Approximation Error Term

Denote the population version of the parent best-scored random tree as

[TABLE]

with $g^{*}_{Z_{j}}$ as in (12). The following theoretical result on bounding the approximation error term shows that, under smoothness assumptions for the Bayes decision function, the regularized approximation error possesses a polynomial decay with respect to each regularization parameter $\lambda_{j}$ .

Proposition 8

Let $L:\mathcal{X}\times\mathcal{Y}\to[0,\infty)$ be the least squares loss, $\mathrm{P}_{X\times Y}:=\mathrm{P}$ be the probability measure on $\mathcal{X}\times\mathcal{Y}$ with marginal distribution $\mathrm{P}_{X}$ , $\mathrm{P}_{Z}$ be the probability measure induced by the splitting criterion $Z$ . Assume that $\{V_{j}\in B_{j}\}_{j=1}^{m}$ is a partition of $\mathcal{X}$ and $k_{j}$ is the number of candidate trees on each $V_{j}$ . Suppose that the Bayes decision function $f_{L,\mathrm{P}}^{*}:\mathcal{X}\to\mathcal{Y}$ satisfies Assumption 4 with exponent $\alpha$ . Then, for any fixed $\tau>0$ and $\lambda:=(\lambda_{1},\ldots,\lambda_{m})>0$ , with probability $\mathrm{P}_{Z}$ at least $1-e^{-\tau}$ , there holds that

[TABLE]

where $c_{\alpha d}$ is a constant depending on $\alpha$ and $d$ , $c_{T}=0.22$ and $K$ is a universal constant.

4.2 Bounding the Sample Error Term

To establish the bounds on the sample error, we give four descriptions of the capacity of the function set in Definition 9, Definition 10, Definition 11 and Definition 12. Then, we should analyze on the complexity of the regression function set so as to derive the sample error bounds. More specifically, the complexity of the random forest function set comes from two aspects which are one induced by the feature space partition and the other induced by value assignment.

Firstly, we consider the complexity induced by partition. In that case, we might scrutinize the situation where there is a binary value assignment, i.e. $\{-1,1\}$ . Particularly, we need to focus on its VC dimension (Lemma 13), covering numbers (Lemma 14) and entropy numbers (Lemma 15). Secondly, there exists a relationship in terms of empirical Rademacher average between the binary value assignment induced complexity and the continuous value assignment induced complexity. Therefore, we are able to derive the empirical Rademacher average for regression in Lemma 16.

Definition 9 (VC dimension)

Let $\mathcal{B}$ be a class of subsets of $\mathcal{X}$ and $A\subset\mathcal{X}$ be a finite set. The trace of $\mathcal{B}$ on $A$ is defined by $\{B\cap A:B\in\mathcal{B}\}$ . Its cardinality is denoted by $\Delta^{\mathcal{B}}(A)$ . We say that $\mathcal{B}$ shatters $A$ if $\Delta^{\mathcal{B}}(A)=2^{\#(A)}$ , that is, if for every $\tilde{A}\subset A$ , there exists a $B\subset\mathcal{B}$ such that $\tilde{A}=B\cap A$ . For $k\in\mathbb{N}$ , let

[TABLE]

Then, the set $\mathcal{B}$ is a Vapnik-Chervonenkis class if there exists $k<\infty$ such that $m^{\mathcal{B}}(k)<2^{k}$ and the minimal of such $k$ is called the VC dimension of $\mathcal{B}$ , and abbreviated as $\mathrm{VC}(\mathcal{B})$ .

Definition 10 (Covering Numbers)

Let $(X,d)$ be a metric space, $A\subset X$ and $\varepsilon>0$ . We call $\tilde{A}\subset A$ an $\varepsilon$ -net of $A$ if for all $x\in A$ there exists an $\tilde{x}\in\tilde{A}$ such that $d(x,\tilde{x})\leq\varepsilon$ . Moreover. the $\varepsilon$ -covering number of $A$ is defined as

[TABLE]

where $B_{d}(x,\varepsilon)$ denotes the closed ball in $X$ centered at $x$ with radius $\varepsilon$ .

Definition 11 (Entropy Numbers)

Let $(X,d)$ be a metric space, $A\subset X$ and $n\geq 1$ be an integer. The $n$ -th entropy number of $(A,d)$ is defined as

[TABLE]

Definition 12 (Empirical Rademacher Average)

Let $\{\varepsilon_{i}\}_{i=1}^{m}$ be a Rademacher sequence with respect to some distribution $\nu$ , that is, a sequence of i.i.d. random variables such that $\nu(\varepsilon_{i}=1)=\nu(\varepsilon_{i}=-1)=1/2$ . The $n$ -th empirical Rademacher average of $\mathcal{F}$ is defined as

[TABLE]

Here, we first analyze the complexity of the function set of binary value assignment case. For fixed $p\in\mathbb{N}$ , we denote the collection of trees with number of splits $p$ as

[TABLE]

where we should emphasize that all trees in (18) must follow our specific construction procedure described in Section 2.2.1 and 2.3.1. It can be verified that the nested relation $\tilde{\mathcal{T}}_{q}\subset\tilde{\mathcal{T}}_{p}$ for $q\leq p$ holds. In the following analysis, for convenience sake, we need to reformulate the definition of $\tilde{\mathcal{T}}_{p}$ . Let $p\in\mathbb{N}$ be fixed. Let $\mathcal{\pi}$ be a partition of $\mathcal{X}$ with number of splits $p$ and $\mathcal{\pi}_{p}$ denote the family of all partitions $\mathcal{\pi}$ . Moreover, we define

[TABLE]

Then, for all $g\in\tilde{\mathcal{T}}_{p}$ , there exists some $B\in\mathcal{B}_{p}$ such that $g$ can be written as $g=\boldsymbol{1}_{B}-\boldsymbol{1}_{B^{c}}$ . Therefore, $\tilde{\mathcal{T}}_{p}$ can be equivalently defined as

[TABLE]

Now, we are able to give the VC dimension of $\mathcal{B}_{p}$ as follows:

Lemma 13

The VC dimension of $\mathcal{B}_{p}$ can be upper bounded by $dp+2$ .

After establishing the bound of the VC dimension, we can then give the covering numbers of (19) and (20). Let $\mathcal{B}$ be a class of subsets of $\mathcal{X}$ , denote $\boldsymbol{1}_{\mathcal{B}}$ as the collection of the indicator functions of all $B\in\mathcal{B}$ , that is, $\boldsymbol{1}_{\mathcal{B}}:=\{\boldsymbol{1}_{B}:B\in\mathcal{B}\}$ . Moreover, as usual, for any probability measure $Q$ , $L_{2}(Q)$ is denoted as the $L_{2}$ space with respect to $Q$ equipped with the norm $\|\cdot\|_{L_{2}(Q)}$ .

Lemma 14

Let $\mathcal{B}_{p}$ and $\tilde{\mathcal{T}}_{p}$ be defined as in (19) and (20) respectively. Then, for all $0<\varepsilon<1$ , there exists a universal constant $K$ such that

[TABLE]

and

[TABLE]

hold for any probability measure $Q$ .

For the parent best-scored random tree function set $\tilde{\mathcal{T}}_{Z}$ where the partition follows the same splitting criterion as in Assumption 3 but in the binary value case, we denote

[TABLE]

We emphasize at this point that the functions in $\tilde{\mathcal{T}}_{Z}$ take values in $\{-1,1\}$ . For $\tilde{r}>r^{*}$ ,

[TABLE]

where $L\circ g$ denotes the least squares loss of $g$ . Moreover, we define $r^{*}$ , $\mathcal{T}_{r}$ and $\mathcal{H}_{r}$ similarly for the function set $\mathcal{T}_{Z}$ , respectively.

Lemma 15

Let $\tilde{\mathcal{H}}_{r}$ be defined as in (25). Then, for all $\delta\in(0,1)$ , the $i$ -th entropy number of $\tilde{\mathcal{H}}_{r}$ satisfies

[TABLE]

According to the above Lemma 15, we are able to deduce the Rademacher average of the binary value case and then the Rademacher average of our regression case.

Lemma 16

Let $\tilde{\mathcal{H}}_{r}$ be the function set defined as in (25). For any $\delta\in(0,1)$ , there holds that

[TABLE]

where $c_{\delta,1}$ and $c_{\delta,2}$ are constants depending only on $\delta$ that will be stated in the proof. And

[TABLE]

5 Architecture Analysis

In this section, we first summarize the two-stage best-scored random forest algorithm in Section 5.1, then clear illustrations on the reasons why our strategy is a subtle match for the large-scale regression are presented in Section 5.2. In Section 5.3, we emphasize the fact that our two-stage forest algorithm is indeed an inclusive framework where different mainstream regression approaches can be incorporated as value assignment strategies for leaves of trees. To give further descriptions of our inclusive framework, analysis on certain real data sets are provided as illustrative examples in Section 5.4.

5.1 Algorithm Construction

The construction of the two-stage best-scored random forest (TBRF) predictors demonstrated in Section 2 can be summarized by pseudocode in Algorithm 1.

It is worth noting that when in experiments, the partition of the feature space at stage one and the continuous partitions on cells at stage two all follow the adaptive random partition described in Section 2.3.1. In this way, by increasing the effective number of splits, the empirical performances can be further improved. Moreover, for the convenience of computation, we let the number of candidates in each cell to be equal, that is $k^{t}_{1}=k^{t}_{2}=\ldots=k^{t}_{m}=k$ , $t=1,\ldots,T$ .

When it comes to the empirical performances in terms of prediction accuracy, an appropriate measurement is in demand. In this paper, we adopt the ubiquitous Mean Squared Error (MSE) through all experiments:

[TABLE]

where $\hat{f}$ represents the predictor and $\{(X_{i},Y_{i})\}_{i=1}^{n_{\text{test}}}$ are the test samples.

5.2 A Subtle Match for Large-scale Regression

This subsection sheds light on the truth that compared to some published vertical methods, the two-stage best-scored random forest is authentically a subtle match for the large-scale regression problems. The essence of our algorithm lies in the fact that by separating the splitting process of the feature space into two stages, the TBRF significantly speeds up the calculation via parallel computing without changing the algorithm structure of the trees in the random forest method. Moreover, on account that the randomness resided in the splitting criterion makes the ensemble learning available, the boundary discontinuities that long plague other vertical strategies can then be naturally solved. We mention here that partitions in all stages of TBRF in experiments follow the adaptive random splitting criterion to increase the effective number of splits.

•

PK: Patchwork kriging (PK) proposed by Park and Apley (2018) is an approach for Gaussian process (GP) regression for large datasets with the well-known discontinuity problem mitigated via adding pseudo-observations to the boundaries. Previous Gaussian process vertical methods put forward in Park et al. (2011) and Park and Huang (2016) have tried to join up the boundaries of the adjacent local GP models, and PK is the newest upgraded version of this kind. Though spatial tree is employed in PK to generate data partitioning of uniform sizes when data is unevenly distributed, the boundaries are still artificially selected just as the mesh partitioning proposed in previous papers. Moreover, although efforts have been paid to conquer the boundary discontinuities, this method in turn has to face other challenges. For example, when encountering data with high dimension and large volume, in order to achieve better prediction accuracy, more pseudo-observations need to be added to the boundaries, which leads to a significant growth in computational complexity. In addition, the discontinuities sometimes still exist, see Figure 2. Last but not least, PK has not been proceeded by parallel computing and therefore, takes time to derive accurate results.

•

VP-SVM: Support vector machines for regression being a global algorithm is impeded by super-linear computational requirements in terms of the number of training samples in large-scale applications. To address this, Meister and Steinwart (2016) employs a spatially oriented method to generate the chunks in feature space, and fit LS-SVMs for each local region using training data belonging to the region. This is called the Voronoi partition support vector machine (VP-SVM). However, the boundaries are artificially selected and the boundary discontinuities do exist.

•

TBRF: When facing large-scale regression problems, our two-stage best-scored random forest (TBRF) subtly divides the partition process of each tree in random forest into two steps so that parallel computing can come in handy. As is mentioned in Section 2, in stage one, the feature space is partitioned into non-overlapping cells following an adaptive random splitting criterion which is completely data-driven, and then random partitions are continuously conducted on each cell separately in stage two. Careful observation of the overall partition will find that the two-stage splitting result is equivalent to that of carrying out continuous uninterrupted splits on the feature space directly. Therefore, the computation speed can be significantly accelerated by assigning each cell to different cores without changing the algorithm structure of trees in the random forest. Recall that the boundary discontinuities inevitable for the existing vertical methods come from the fact that they all attempt to execute a globally smooth algorithm independently to cells of the artificial partition of the feature space. In this manner, compared to predictors obtained by applying the algorithm integrally to the input space, predictors of the existing vertical methods can only maintain smoothness within cells while the discontinuities appear on those artificially chosen boundaries. Different from those vertical methods with fixed partitions, the randomness resided in our partition paves the way for ensemble learning. Even though each tree in the random forest is inherently a piecewise smooth algorithm, by taking full advantages of the ensemble learning, the integrating forest achieves an asymptotic smoothness, which naturally blurs or even almost eliminates the boundary discontinuities of each tree. Therefore, there is no need to underscore the concept of boundaries anymore. As a result, instead of adding continuity constraints to the boundaries, we settle the well-known discontinuities via ensemble learning within the random forest architecture.

To provide a more intuitive explanation on how our forest model smoothes the boundaries, we present the following simulation works. We generate a dataset of $50,000$ noisy observations from $y_{i}=f(x_{i})+\epsilon_{i}$ for $i=1,\ldots,50000$ , where $f(x)=\sin x$ , and $x_{i}\sim\mathcal{U}(0,10)$ and $\epsilon_{i}\sim\mathcal{N}(0,0.2)$ are independently sampled. Comparisons are conducted between PK, VP-SVM and TBRF with results shown in Figures 2, 3 and 4, respectively.

It can be seen from Figure 2 that even though PK tries to join up the boundaries, the resulting predictor at the connected point is kind of artificial (upper-right image), or is sometimes still discontinuous (middle-right and lower-right images). The boundary discontinuities of VP-SVM can be easily observed from Figure 3, while our TBRF achieves asymptotic smoothness with the number of trees in the forest $T$ increasing in Figure 4. Note that the above satisfactory performances of TBRF also have strong theoretical supports:

(i) Although our algorithm is obtained by conducting local computations (e.g. stage two), we establish the global convergence and particularly, the good global learning rates in Theorem 7. One fact is that if we conduct purely random splits on the feature space recursively, the resulting tree will converge. Now, our algorithm only divides the original recursive splits into two stages and it is intrinsically the same with the original splitting procedures from the perspective of the results. There being no change in the random tree structure is the reason why TBRF maintains a global convergence even under local computations.

(ii) The TBRF achieves asymptotic smoothness. Since we employ the purely random splitting criterion, for any point in the feature space, the probability of its being on the boundaries of one tree is zero in theory. Moreover, the probability remains zero of its being on the boundaries of several trees simultaneously. Therefore, for any point in the feature space, the estimation function is smooth with high probability at that point. Even if, unfortunately, this point appears on the boundaries of certain trees, it is with low probability for it to be also on the boundaries of others. Consequently, with the number of trees in the forest increasing, the discontinuities occur at this point will be asymptotically smoothed by other trees.

5.3 An Inclusive Framework

The TBRF investigated so far hinges on random trees whose leaf value assignments are according to the random tree decision rule shown in (1). More precisely, trees in the forest are piecewise constant predictors. However, experiments on high-dimensional data reveal that the piecewise constant random forest might not be adequate enough to provide accurate prediction. This imperfection actually serves as an opportunity for us to figure out how inclusive and versatile the framework of the TBRF can be. Specifically, by incorporating some standard regression algorithms as alternative value assignment methods into our two-stage random forest framework, the prediction accuracy is essentially improved. Accordingly, the TBRF model is then branched into three different modeling paths with respect to what kinds of leaf value assignment approaches are chosen:

•

TBRF-C: If the value of each leaf node of the trees is assigned the average of the output values of samples falling into that leaf, then the trees in the forest are piecewise constant predictors. We name this branch of our algorithm the piecewise constant TBRF, abbreviated as TBRF-C. This simple model captures the constant models and shows great efficiency when dealing with low-dimensional data.

•

TBRF-L: If the value of each leaf node of the trees is calculated via the linear regression based on the output values of samples falling into that leaf, then the trees in the forest are piecewise linear predictors. This model is named the piecewise linear TBRF. For the convenience of computation, we adopt the LS-SVMs for regression with linear kernel $K(X,X^{\prime})=X^{T}X^{\prime}$ where $X$ and $X^{\prime}$ are vectors in the input space. The assumption of this underlying linear model may improve the prediction accuracy in the low-dimensional examples.

•

TBRF-G: The prediction accuracy of the above two branches of the TBRF decrease when encountering high-dimensional data sets. To address this, we propose to use the LS-SVMs with Gaussian RBF kernel $K(X,X^{\prime})=\exp(-\gamma\|X-X^{\prime}\|^{2})$ , $\gamma>0$ for regression as the value assignment strategy. This piecewise Gaussian TBRF (TBRF-G) shows great superiority when facing data with high dimension.

By incorporating different mainstream regression strategies into the framework of our two-stage best-scored random forest, we are not only able to accelerate the algorithm via parallel computing, but also obtain prediction with high accuracy. Consequently, our TBRF is an inclusive and versatile framework that is exactly befitting for large-scale regression problems.

5.4 Illustrative Examples

In this section, we design illustrative real data analysis to figure out what kind of assignment strategy should be combined into the two-stage best-scored random forest framework when handling different types of data sets. Specifically, the analysis is based on the following three real data sets.

The first real data set is the TCO, which contains data collected by NIMBUS-7/TOMS satellite to measure the total column of ozone over the globe on Oct 1, 1988. The data consist of $48,331$ two-dimensional samples representing the locations recorded by longitudes and latitudes. The observations are uniformly spread over the range of the longitude and latitude, and the learning goal is to predict the total column of ozone at any unobserved location. We randomly split each dataset into a training set containing $70\%$ of the total observations and a test set containing the remaining $30\%$ of the observations.

The second real data set is the Physicochemical Properties of Protein Tertiary Structure Data Set ( $\tt{PTS}$ ) on UCI data sets, which contains $45,730$ samples of dimension $9$ . Similar as other high dimensional data, the measurements are embedded on a low dimensional subspace of the entire domain. The ratio of number of samples in the training set and the testing set is $7:3$ .

The third real data set is the SARCOS, containing $44,484$ training and $4,449$ testing observations from a seven degrees-of-freedom SARCOS anthropomorphic robot arm. The $21$ -dimensional input data represent attributes such as positions, moving velocities, accelerations of the joints of the robot arm, etc. As for outputs, we use the first one out of the seven response variables for numerical study. The main learning task is to predict one of the joint torques in a robot arm when the input observation is available.

We now conduct empirical comparisons on MSE and training time among TBRF-C, TBRF-L and TBRF-G on these three data sets where $n$ and $d$ denote sample size and dimension, respectively. Since all three branches of the TBRF are based on Algorithm 1, we now discuss how to conduct the leaf values assignment:

•

TBRF-C: We take the average output of the samples falling into certain leaf as the corresponding leaf value.

•

TBRF-L: We utilize LS-SVMs with linear kernel based on the samples falling into certain leaf to derive a regression function on that leaf. To pick up the appropriate hyperparameter $C$ which trades off the accuracy deduced by the training data and the simplicity of the decision function, we randomly divide the samples falling into certain leaf into two parts where 70% of the samples are used for training and the rest for validation. By employing the appropriate hyperparameter, we derive the regression function on that leaf based on the overall samples.

•

TBRF-G: We utilize LS-SVMs with Gaussian kernel based on the samples falling into certain leaf to derive a regression function on that leaf. In order to select the appropriate hyperparameters pair $(C,\gamma)$ , we randomly divide the samples falling into the leaf into $7:3$ where grid search is conducted on $70\%$ of the samples and the rest $30\%$ are utilized to make the choice of the hyperparameters pair. After that, regression with chosen hyperparameters is conducted based on all samples falling in that leaf.

From now on, to train models for sufficiently large data sets, we use a professional compute server equipped with eight Intel(R) Xeon(R) CPU E7-8860 v3 (2.20GHz) 16-core processors, 64 GB RAM. For the sake of fairness, other methods for comparison in the following work are also trained on this server.

As is illustrated in the Algorithm 1, after partitioning the feature space into $m$ non-overlapping cells, we need to keep splitting on each cell to build child tree. However, considering that the numbers of samples falling into different cells are different, it is improper to apply the same number of splits to all cells. To address this, we let the number of splits on the cell proportional to the number of samples falling into that cell, and the proportional coefficient $\tt{pro}$ is tunable. For all three data sets TCO, PTS and SARCOS, we adopt the same parameters configuration for grid search:

•

TBRF-C: $T\in\{20,50\}$ , $m\in\{20,50,200\}$ , $k\in\{10,100\}$ and ${\tt pro}\in\{0.2,0.5,0.8\}$ ;

•

TBRF-L: $T\in\{20,50\}$ , $m\in\{30,50,70\}$ , $k\in\{1,5,20\}$ and ${\tt pro}\in\{0.2,0.5,0.7\}$ ;

•

TBRF-G: $T\in\{20,50\}$ , $m\in\{5,20,40\}$ , $k\in\{1,5\}$ and ${\tt pro}\in\{0.005,0.05,0.2\}$ .

Figure 5 shows the main results. It can be observed that for the low-dimensional data, TBRF-C shows both fast and accurate performance over TBRF-L and TBRF-G where TBRF-G takes longer time to derive satisfying results. For PTS consisting of $9$ -dimensional data, TBRF-G contends against TBRF-C. For SARCOS, TBRF-G gradually shows its advantages over the other two methods in terms of test error and training time on higher-dimensional data set. Through these experiments, we find out that TBRF-C is enough to handle low-dimensional data while TBRF-G is qualified to settle the high-dimensional ones. Therefore, we treat TBRF-C and TBRF-G as two main variants of the TBRF.

6 Numerical Evaluation

This section is concerned with the implementation issues and empirical assessments of the two-stage best-scored random forest. In order to further enhance the prediction accuracy of the forest predictor, we propose some improvement techniques in Section 6.1. Parameter analysis for two main variants TBRF-C and TBRF-G are conducted separately in Section 6.2. Last but not least, comparison experiments on more real-world data are designed to test the sharpness of our theoretical predictions.

6.1 Experimental Improvements

This subsection introduces two experimental improvements available for ameliorating the prediction accuracy of TBRF predictors. According to the type of data at hand, we can selectively use these methods for better experimental results.

6.1.1 Adaptive Oblique Random Partition

Till now, the partition processes considered have only performed in an axis-parallel manner. Nevertheless, there comes one caveat that if the underlying concept is defined by a polygonal space partitioning, then the axis-parallel partition may not be that accurate for it can only approximate the correct model with staircase-like structure. On the contrary, oblique partitioning (Breiman et al., 1984; Utgo and Brodley, 1991; Brodley and Utgoff, 1992; Murthy et al., 1994) approaches proposed in the literature serves as a proper alternative.

Similar as the axis-parallel random splitting criterion demonstrated in Section 2.2.1, we can also formalize one possible building process following the oblique random splitting criterion. To be concrete, a random vector $O_{i}:=(L_{i},G_{i},W_{i})$ is introduced to reveal the splitting mechanism at the $i$ -th step. Let $L_{i}$ denote the to-be-split cell at the $i$ -th step that is chosen from all cells presented at the $(i-1)$ -th step uniformly at random, $G_{i}\in\mathrm{R}^{d}$ represent the barycenter of cell $L_{i}$ and $W_{i}$ be the normal vector of the split hyperplane at that step. To notify, since all partitions conducted in the experiments follows the adaptive criterion, we now give the operation of the adaptive random oblique partition. First of all, $t$ samples are randomly selected from the training data set each of which is certain to fall into one of the cells formed in the $(i-1)$ -th step. Hence, the cell where most of these samples fall in is assigned to be the to-be-split cell $L_{i}$ . Then, since the coordinates of samples falling into cell $L_{i}$ among the $t$ samples are recorded, we can substitute the barycenter $G_{i}$ for their centroid $X_{i}^{c}$ . Next, the actual split performed on $L_{i}$ is a part of the chosen hyperplane $W_{i}^{T}X+b_{i}=0$ , $X\in L_{i}$ . To live out the random splitting rule, we let the normal vectors of the hyperplanes $\{W_{i},\ i\in\mathbb{N}_{+}\}$ be i.i.d. distributed from $\mathcal{U}[-1,1]^{d}$ and $b_{i}:=-W_{i}^{T}X_{i}^{c}$ . Now that we have finished the operation of the $i$ -th step, the random tree with oblique partitions can be constructed by following this procedure recursively. An example of the construction process of the oblique partition is shown in Figure 6.

6.1.2 Vacancy Filling

Since we employ a random partition, after completing the two-stage partition, we need to deal with the situation where there are child cells111In the previous sections, we name the results $V_{1},\ldots,V_{m}$ of the feature partition at stage one as cells. Since the splits will continue to be applied to partition these cells into even small cells at stage two, the resulting small cells are called the child cells here for clarity. with no samples fallen in. Therefore, we come up with two different solutions to label the empty child cells, namely, to fill the vacancy:

•

Mean-based solution: For an empty child cell, the mean value of the samples contained in the cell where this empty child cell is located is assigned to it. This is a generally applicable vacancy filling methods, available for both axis-parallel and oblique partitions. Moreover, this method has already been used in the illustrative examples.

•

$1$ -NN-based solution: For an empty child cell, we assign the value of the closest non-empty child cell to it where we take the Euclidean distance between the geometric centers of child cells as child cell distance. Note that this vacancy filling can only be applied to partitions with regular shapes, e.g. axis-parallel partition in our case.

Experiments in the following work illustrate that our vertical method combined with mean-based solution runs faster than that with $1$ -NN-based solution, however, the $1$ -NN-based one provides more accurate results. Therefore, considering that there exists a trade-off between training time and accuracy, we use method with the appropriate solution for different data sets. To be specific, if two solutions lead to almost the same accuracy, we choose the one with shorter training time; if the accuracy caused by the two methods is very different, we pick up the high-accuracy one.

6.2 Parameter Analysis

The satisfying results of the illustrative examples in Section 5.4 strike us that the multiple tunable hyperparameters in our model may be the reason why better prediction accuracy is available. Aiming at verifying this idea, we design the following experiments concentrating on the MSE and training time performance of the two main variants of the TBRF, which are TBRF-C and TBRF-G, separately for different hyperparameters values. On account that the main focus is on the analysis of the two-stage forest structure of TBRF, we omit the hyperparameters in the value assignment strategies here. More specifically, the target hyperparameters include the number of trees in the forest $T$ , the number of cells in feature space partition at the stage one $m$ , the proportional coefficient related to the number of splits at the stage two $\tt{pro}$ , and the number of candidates for each child tree $k$ . Considering that TBRF-C has better performance when handling low-dimensional data, the analyses are carried out on the TCO data set and we set the ratio of the training set to the testing set being $7:3$ . As for TBRF-G, since it shows its advantages in settling high-dimensional data, we introduce the fourth data set utilized to conduct the analysis.

The fourth data set is the Year Prediction MSD Data Set (MSD) available on UCI. This data contains $463,715$ training samples and $51,630$ testing samples with $90$ attributes describing timbre average and timbre covariance. Each example is a song (track) released between 1922 and 2011. The main task is to predict the year in which a song was released based on audio features associated with the song.

The parameter analysis of TBRF-C on TCO is shown in Figure 7. Experiments here are repeated for $50$ times. An observation on all subfigures will find that by increasing either $T$ or $k$ , the accuracy of our forest algorithm will have a significant improvement with only some training time sacrificed. Moreover, for each of the different $(T,k)$ pairs under a fixed $m$ , there exists a optimal pro in stage two with regard to the test error.

The parameter analysis of TBRF-G on MSD is shown in Figure 8. Since the sample volume is relatively large, experiments here are repeated for $10$ times. For experiments in the first row, we fixed the proportional coefficient pro in stage two to $0.0001$ . It can be observed that for fixed $T$ , with $m$ or pro increasing, which is approximately equivalent to partitioning the whole feature space into more child cells, the MSE increases. It is because that more partitions will lead to more discontinuous boundaries, so that the ensemble predictors will also be less smooth than before. However, the increase in $m$ or pro is beneficial for the training time. Specifically, when applying LS-SVMs with Gaussian kernel for value assignment in each child cell, the number of samples needed is reduced, which in turn speeds up the training procedure of LS-SVMs and then the whole algorithm. Moreover, with the number of trees in the forest $T$ increasing, the ensemble predictor will be smoother than ever so that lower MSE can then be obtained. The analysis on experiments in the second row is nearly the same, only with fixed $m=299$ . To conclude, for TBRF-G, less splits on the feature space will bring less discontinuous boundaries so that the ensemble predictor will be smoother and lower MSE can be achieved, even though the training time will be longer.

6.3 Real Data Comparisons

The previous sections have defined TBRF and achieved satisfying learning rates. Based on these, we wonder whether the theoretical advantages of our algorithm in terms of accuracy can still be preserved in practice compared to other state-of-the-art vertical strategies. Moreover, we are encouraged to explore from the experiments whether our vertical method can also save computational costs. Taking these into consideration, comparisons are made in between our method and the other vertical methods previously listed.

In addition to the four data sets TCO, PTS, SARCOS and MSD considered earlier in Sections 5.4 and 6.2, other two data sets are introduced to give more comprehensive comparisons between approaches.

The fifth data set is the Appliances energy prediction Data Set (AEP) on UCI containing $19,735$ samples of dimension $27$ with attribute “date” removed from the original data set. The data is used to predict the appliances energy use in a low energy building.

The last real data set is the House-Price-8H prototask (HPP) of the Census-house Data Set available from the DELVE repository which contains $22,784$ samples of dimension $8$ . To notify, for the sake of clarity, all house prices in the original data set has been modified to use one thousand dollars as the unit.

Note that except for MSD and SARCOS data sets, we divide each of the other four data sets randomly into training set and testing set with $70\%$ and $30\%$ of the total number of samples, respectively. Furthermore, all experiments conducted here are repeated for $10$ times. Now, we summarize the comparison results of TBRF, PK and VP-SVM in Table 1.

As is apparently observed from Table 1, compared to other vertical methods, our TBRF has the lowest MSE on all data sets, and by taking full advantages of the parallel computing, we also almost achieve the shortest training time. By contrast, PK is unable to be parallelized, leading to the longest training time.

It is well worth mentioning that aiming at significantly enhancing the prediction accuracy, appropriate variants of the TBRF are applied to the data sets proposed in Table 1. For TCO, considering that it is only a $2$ -dimensional data set, we directly utilize the simplest and fastest TBRF-C to deal with the problem. We mention that we adopt the $1$ -NN-based TBRF-C for this data set since the MSE of mean-based TBRF-C is not smaller than those of PK and VP-SVM. For PTS, since the performance of TBRF-L in terms of prediction accuracy is not much higher than that of TBRF-C, while the training speed is much lower, we adopt the mean-based TBRF-C for this case. It is the same for SARCOS, AEP and HPP. However, MSD data set is different from others for it not only has large volume, but more importantly, has high dimensionality. To address this, we employ the mean-based TBRF-G which has good command at dealing high-dimensional data set, since the other value assignment approaches are mainly fit for low-dimensional data. Moreover, adaptive oblique random partition is also employed to obtain better results. Having witnessed the better performance of TBRF over VP-SVM, especially on MSD, we wonder the reasons for this phenomenon. It may be because that the boundary discontinuities thwart VP-SVM from being global smooth while our TBRF-G are able to achieve an asymptotic global smoothness thanks to the ensemble learning within the forest structure. Luckily, this viewpoint can be supported by the experimental results. For example, the lower part in Table 1 demonstrates that we obtain a much smaller MSE of $81.11$ by utilizing TBRF-G and it only takes $326.85$ seconds to run the process. An even accurate result of $80.33$ as MSE can be achieved if we sacrifice more time.

Experimental results presented so far are those we have temporarily tuned. More accurate results can be obtained if we sacrifice more training time, which is different from other methods for their accuracy are hard to be increased. Readers who are interested in these experiments can try to use larger $T$ , $k$ and appropriate $m$ , pro to further obtain even lower test errors.

7 Proof

To prove Proposition 8, we need the following result which follows from Lemma 6.2 in Devroye (1986).

Lemma 17

For a binary search tree with $n$ nodes, denote the saturation level $S_{n}$ as the number of full levels of nodes in the tree. Then for $k\geq 1$ , $\log n>k+\log(k+1)$ , there holds

[TABLE]

**Proof **[of Proposition 8] Obviously we have

[TABLE]

Therefore, in order to analyze the global approximation error of $g^{*}_{Z}$ , it suffices to consider the local approximation error of $g^{*}_{Z_{j}}$ corresponding to the cell $V_{j}$ , $j\in\{1,\ldots,m\}$ . For this purpose, we start by studying the behavior of the candidate best-scored random tree $g^{*}_{Z_{js}}$ for fixed $s\in\{1,\ldots,k_{j}\}$ .

Denote $g^{*}_{Z_{js},p}$ as the function that minimize $\mathcal{R}_{L_{j},\mathrm{P}}(g)-\mathcal{R}_{L_{j},\mathrm{P}}^{*}$ in the function set $\hat{\mathcal{T}}_{Z_{js},p}$ defined in (5). Elementary calculation shows that

[TABLE]

where $\mathcal{A}:=\{A_{i}\}_{i=0}^{p}$ are the rectangular cells generated by $Z_{js}$ which forms a partition of $V_{j}$ and $\mathrm{P}_{X}(A_{i})$ is the measure of $A_{i}$ with respect to the marginal distribution of $X$ . Then

[TABLE]

where we decompose the error by the diameter of the cells in $\mathcal{A}$ . That is

[TABLE]

where $\mathrm{diam}(A)$ is the diameter of $A$ . In the following proof, we take the $L_{1}$ -norm into consideration which leads to the definition of the diameter of the cell as $\mathrm{diam}(A):=\sum_{i=1}^{d}V_{i}(A)$ , where $V_{i}(A)$ denotes the length of the $i$ -th dimension of the rectangle cell $A$ .

Let us now consider the first term in the decomposition (7). For $A\in\mathcal{A}_{1}$ , the diameter of the cell is less than $h$ . Then for any $x,z\in A\in\mathcal{A}_{1}$ , the distance between the two points satisfies $\|x-z\|_{1}\leq h$ . Using Assumption (4), we get

[TABLE]

For the second term in the decomposition (7), elementary considerations imply that

[TABLE]

Then by Markov’s inequality, we obtain

[TABLE]

As is mentioned previously, $Z$ is defined by $(Q_{1},\ldots,Q_{p},\ldots)$ where $Q_{i}=(L_{i},R_{i},S_{i})$ , $i=0,1,\ldots$ in Section 2.2.1. From these we find that the randomness of $Z$ is the result of three factors, which are randomness in selecting leaves, randomness in picking dimensions, and randomness in determining cut points. Next, in order to calculate the expectation with respect to $Z$ in (7), we conduct the following analysis suppose that the tree has already been well established. To be specific, for each dimension, we only need to consider one cell that has the longest side length in its respective dimension. Additionally, since there is symmetry between dimensions, it suffices to first concentrate on one dimension. For example, we consider the $i$ -th dimension and denote the length of the $i$ -th dimension of the corresponding cell as $\max_{A\in\mathcal{A}_{Z}}V_{i}(A)=:V_{Z}$ . We do not have to know the exact constructing procedures of the tree entirety to calculate $\mathbb{E}_{Z}(V_{Z})$ . Instead, we still consider from three aspects which is intrinsically corresponding to the one stated above, but from a different view: the total number of splits that generates that specific rectangle cell during the construction, $T_{Z}$ ; the number of splits which come from the $i$ -th dimension in $T_{Z}$ , $K_{Z}$ and $K_{Z}$ follows the binomial distribution $\mathcal{B}(T_{Z},1/d)$ ; and proportional factors $U_{1},U_{2},\ldots,U_{K_{Z}}$ which are independent and identically distributed from $\mathcal{U}[0,1]$ . Accordingly, the expectation with regard to $Z$ can be decomposed as $\mathbb{E}_{Z}=\mathbb{E}_{T_{Z}}\mathbb{E}_{K_{Z}|T_{Z}}\mathbb{E}_{U_{1}\ldots U_{K_{Z}}|K_{Z}}$ . Moreover, since $V_{j}$ is contained in a ball of radius $r_{j}$ and $\mathcal{A}$ form a partition of $V_{j}$ , without loss of generality, we assume that the partition procedure is performed on a cube with side-length $2r_{j}$ . According to the above analysis, the expectation in the last step in (7) can be further analyzed as follows:

[TABLE]

To notify, when the underlying partition rule $Z$ has number of splits $p$ , the partition tree is statistically related to a random binary search tree with $p+1$ external nodes and $p$ internal nodes. Then, Lemma 17 states that for $k\geq 1$ and $\log(2p+1)>k+\log(k+1)$ ,

[TABLE]

where $S_{2p+1}$ is the saturation level. In our specific setting, $S_{2p+1}$ can be viewed as the minimal number of splits that generates $A\in\mathcal{A}$ . Now taking $k=\lfloor c_{T}\log(2p+1)\rfloor$ where $c_{T}<1$ and $c_{T}(1+\log(2e/c_{T}))<1$ , a simple calculation gives that

[TABLE]

where $C^{\prime}$ and $C$ are universal constants. As a result, we have

[TABLE]

where the last inequality follows from the fact that $1-1/x<e^{-x}$ for all $x>1$ . Since the function $f(c_{T})=1-c_{T}(1+\log(2e/c_{T}))-c_{T}/(4d)$ is monotone decreasing on $(0,1)$ for all $d$ , numerical computation shows that the largest constant for which $1-c_{T}(1+\log(2e/c_{T}))>c_{T}/(4d)$ holds for all $d\geq 1$ cannot be greater than $0.22563$ . Therefore, taking $c_{T}=0.22$ and $K=2C+2$ , there holds $\mathbb{E}_{Z}V_{Z}\leq Kr_{j}p^{-c_{T}/(4d)}$ . Therefore, we obtain that

[TABLE]

Combining (7), (7) and (30), we have

[TABLE]

In other words, with probability at least $1-Kr_{j}dh^{-1}p^{-c_{T}/(4d)}$ , the second term in the error decomposition (7) vanishes.

Now, the estimation (31) together with (27) yields that

[TABLE]

holds with probability at least $1-Kr_{j}dh^{-1}p^{-c_{T}/(4d)}$ . With $e^{-\theta}:=Kr_{j}dh^{-1}p^{-c_{T}/(4d)}$ , simple calculation shows that with probability $\mathrm{P}_{Z}$ at least $1-e^{-\theta}$ , there holds

[TABLE]

By minimizing both hand side of (32), we obtain that

[TABLE]

with the constant $c_{\alpha d}$ concerning only $\alpha$ and $d$ .

Then, the definition of $g^{*}_{Z_{j}}$ in (12) together with the independence of the $k_{j}$ trials implies that with probability at least $1-e^{-\theta}$ there holds

[TABLE]

Using the union bound, we obtain that

[TABLE]

with probability at least $1-me^{-\theta}$ . Once again with variable transformation $e^{-\tau}:=me^{-\theta}$ , we get the final result that

[TABLE]

with probability $\mathrm{P}_{Z}$ at least $1-e^{-\tau}$ .

**Proof **[of Lemma 13] This proof is conducted from the perspective of geometric constructions. Firstly, we concentrate on partition with the number of splits $p=1$ . Because of the dimension of the feature space is $d$ , the smallest number of sample points that cannot be divided by $p=1$ split is $d+2$ . Concretely, owing to the fact that $d$ points can be used to form $d-1$ independent vectors and hence a hyperplane in a $d$ -dimensional space, we might take the following case into consideration: There is a hyperplane consisting of $d$ points all from one class, say class $A$ , and two points $p_{1}^{B}$ , $p_{2}^{B}$ from the opposite class $B$ located on the opposite sides of this hyperplane, respectively. We denote this hyperplane by $H_{1}^{A}$ . In this case, points from two classes cannot be separated by one split (since the positions are $p_{1}^{B},H_{1}^{A},p_{2}^{B}$ ), so that we have $\mathrm{VC}(\mathcal{B}_{1})\leq d+2$ .

Next, when the partition is with the number of splits $p=2$ , we analyze in the similar way only by extending the above case a little bit. Now, we pick either of the two single sample points located on opposite side of the $H_{1}^{A}$ , and add $d-1$ more points from class $B$ to it. Then, they together can form a hyperplane $H_{2}^{B}$ parallel to $H_{1}^{A}$ . After that, we place one more sample point from class $A$ to the side of this newly constructed hyperplane $H_{2}^{B}$ . In this case, the location of these two single points and two hyperplanes are $p_{1}^{B},H_{1}^{A},H_{2}^{B},p_{2}^{A}$ . Apparently, $p=2$ splits cannot separate these $2d+2$ points. As a result, we have $\mathrm{VC}(\mathcal{B}_{2})\leq 2d+2$ .

Inductively, the above analysis can be extended to the general case of number of splits $p\in\mathbb{N}$ . In this manner, we need to add points continuously to form $p$ mutually parallel hyperplanes where any two adjacent hyperplanes should be constructed from different classes. Without loss of generality, we consider the case for $p=2k+1$ , $k\in\mathbb{N}$ , where two points (denoted as $p_{1}^{B}$ , $p_{2}^{B}$ ) from class $B$ and $2k+1$ alternately appearing hyperplanes form the space locations: $p_{1}^{B},H_{1}^{A},H_{2}^{B},H_{3}^{A},H_{4}^{B},\ldots,H_{(2k+1)}^{A},p_{2}^{B}$ . Accordingly, the smallest number of points that cannot be divided by $p$ splits is $dp+2$ , leading to $\mathrm{VC}(\mathcal{B}_{p})\leq dp+2$ .

Moreover, hyperplanes can be generated both vertically and obliquely according to the proof needs. This completes the proof.

**Proof **[of Lemma 14] The inequality (21) follows directly from Lemma 13 and Theorem 9.2 in Kosorok (2008).

For the inequality (22), denote the covering number of $\boldsymbol{1}_{\mathcal{B}_{p}}$ with respect to $\|\cdot\|_{L_{2}(Q)}$ as $\mathcal{N}(\varepsilon):=\mathcal{N}\bigl{(}\boldsymbol{1}_{\mathcal{B}_{p}},\|\cdot\|_{L_{2}(Q)},\varepsilon\bigr{)}$ . Then, there exist $\boldsymbol{1}_{B_{1}},\ldots,\boldsymbol{1}_{B_{\mathcal{N}(\varepsilon)}}\in\boldsymbol{1}_{\mathcal{B}_{p}}$ such that the function set $\{\boldsymbol{1}_{B_{1}},\ldots,\boldsymbol{1}_{B_{\mathcal{N}(\varepsilon)}}\}$ is an $\varepsilon$ -net of $\boldsymbol{1}_{\mathcal{B}_{p}}$ with respect to $\|\cdot\|_{L_{2}(Q)}$ . This implies that for any $\boldsymbol{1}_{B}\in\boldsymbol{1}_{\mathcal{B}_{p}}$ , there exists a $j\in\{1,\ldots,\mathcal{N}(\varepsilon)\}$ such that $\|\boldsymbol{1}_{B}-\boldsymbol{1}_{B_{j}}\|_{L_{2}(Q)}\leq\varepsilon$ . Now, for all $g\in\tilde{\mathcal{T}}_{p}$ , the equivalent definition (20) of $\tilde{\mathcal{T}}_{p}$ tells us that there exists a $\boldsymbol{1}_{B}\in\boldsymbol{1}_{\mathcal{B}_{p}}$ such that $g$ can be written as $g=\boldsymbol{1}_{B}-\boldsymbol{1}_{B^{c}}=2\boldsymbol{1}_{B}-1$ . The above discussion yields that there exists a $j\in\{1,\ldots,\mathcal{N}(\varepsilon)\}$ such that for $g_{j}:=2\boldsymbol{1}_{B_{j}}-1$ , there holds

[TABLE]

This implies that $\{g_{1},\ldots,g_{\mathcal{N}(\varepsilon)}\}$ is a $2\varepsilon$ -net of $\tilde{\mathcal{T}}_{p}$ with respect to $\|\cdot\|_{L_{2}(Q)}$ . Consequently, we obtain

[TABLE]

we thus proved the assertion.

**Proof **[of Lemma 15] First, we notice that for all $g\in\tilde{\mathcal{T}}_{r}$ , the number of splits $p$ can be upper bounded by $q:=\sum_{j=1}^{m}\lceil(r/\lambda_{j})^{1/2}\rceil$ . Then, the nested relation implies that $\tilde{\mathcal{T}}_{r}\subset\tilde{\mathcal{T}}_{q}$ . Therefore, Lemma 14 implies that the covering number of $\tilde{\mathcal{T}}_{r}$ with respect to $L_{2}(\mathrm{D})$ satisfies

[TABLE]

For the least square loss $L$ , we have for any $h,\tilde{h}\in\tilde{\mathcal{H}}_{r}$ ,

[TABLE]

Therefore, similarly as the proof in Lemma 14, we obtain

[TABLE]

where the later inequality follows from the estimate (33). Elementary calculations show that for any $0<\varepsilon<1/\max\{e,K\}$ and $q\geq 1$ , there holds

[TABLE]

Consequently, for all $\delta\in(0,1)$ , we have

[TABLE]

For any fixed $\delta\in(0,1)$ , simple analysis shows that the right hand side of (34) is maximized at $\varepsilon^{*}=e^{-1/(2\delta)}$ and consequently we obtain

[TABLE]

Then Exercise 6.8 in Steinwart and Christmann (2008) implies that the entropy number bound of $\tilde{\mathcal{H}}_{r}$ with respect to $L_{2}(\mathrm{D})$ satisfies

[TABLE]

Obviously this bound holds for $\mathbb{E}_{\mathrm{D}\sim\mathrm{P}}\,e_{i}(\tilde{\mathcal{H}}_{r},\|\cdot\|_{L_{2}(\mathrm{D})})$ as well. The proof is finished.

**Proof **[of Lemma 16] First notice that for all $h\in\tilde{\mathcal{H}}_{r}$ , there holds

[TABLE]

Now $\|h\|_{\infty}\leq 4=:B$ , $a:=\big{(}75d/(2e\delta)\sum_{j=1}^{m}(r/\lambda_{j})^{1/2}\big{)}^{1/(2\delta)}\geq B$ in Lemma 15 together with Theorem 7.16 in Steinwart and Christmann (2008) yields that

[TABLE]

where

[TABLE]

with

[TABLE]

We thus derive that

[TABLE]

where $c_{\delta,1}$ and $c_{\delta,2}$ are the same as defined above.

**Proof **[of Theorem 5] For the least square loss $L$ , the supremum bound

[TABLE]

holds for $B=4M^{2}$ and the variance bound

[TABLE]

holds for $V=16M^{2}$ and $\vartheta=1$ . Moreover, Lemma 16 implies that the expected empirical Rademacher average of $\mathcal{H}_{r}$ defined in (25) can be upper bounded by the function $\varphi_{n}(r)$ as

[TABLE]

where the constants $c_{\delta,1}$ and $c_{\delta,2}$ are defined as in the proof of Lemma 16. It can be easily concluded that for this $\varphi_{n}$ , the condition that $\varphi_{n}(4r)\leq 2\sqrt{2}\varphi_{n}(r)$ is satisfied. This implies that the statements of the Peeling Theorem 7.7 in Steinwart and Christmann (2008) still hold for $\varphi_{n}(4r)\leq 2\sqrt{2}\varphi_{n}(r)$ . Accordingly, the assumption concerning $\varphi_{n}$ and $r$ in Theorem 7.20 in Steinwart and Christmann (2008) should be modified to $\varphi_{n}(4r)\leq 2\sqrt{2}\varphi_{n}(r)$ and

[TABLE]

respectively. Some elementary calculations show that the condition $r\geq 75\varphi_{n}(r)$ is satisfied if

[TABLE]

where the constant $c_{d\delta}$ depends only on $d$ , $\delta$ and $M$ . In the end, from definition (23) we have $r^{*}\leq p^{2}(g_{L,D,Z})+\mathcal{R}_{L,\mathrm{P}}(g_{L,\mathrm{P},Z})-\mathcal{R}_{L,\mathrm{P}}^{*}$ and the assertion follows from Theorem 7.20 in Steinwart and Christmann (2008).

**Proof **[of Theorem 6] Theorem 5 together with Proposition 8 implies that with probability $\mathrm{P}_{(X\times Y)\otimes Z}$ at least $1-4e^{-\tau}$ , there holds that

[TABLE]

where $c_{d\delta}$ and $c_{\alpha d}$ are the constants defined in Theorem 5 and Proposition 8 respectively. To minimize the right hand side of (7), we choose the regularization parameter as

[TABLE]

where $c_{j}$ is a constant depending on $\alpha,\tau,m,d,\delta,M,r_{j},k_{j}$ and $\mathrm{P}_{X}(V_{j})$ . Therefore, we obtain that

[TABLE]

with the constant $C$ depending on $\alpha,\tau,\delta,d,m,M$ and $\{r_{j},k_{j},\mathrm{P}_{X}(V_{j})\}_{j=1}^{m}$ .

**Proof ** [of Theorem 7] For the least square loss $L$ , there holds

[TABLE]

Consequently, combining with Cauchy-Schwarz inequality we have for the two-stage best-scored random forest (17)

[TABLE]

The union bound together with Theorem 6 states that

[TABLE]

where $C$ is as in Theorem 6. As a result, with probability $\mathrm{P}_{(X\times Y)\otimes Z}$ at least $1-4e^{-\tau}$ , there holds

[TABLE]

where $C$ depending on $\alpha,\tau,\delta,d,m,M,T$ and $\{\{r_{j}^{t},k_{j}^{t},\mathrm{P}_{X}(V_{j}^{t})\}_{j=1}^{m}\}_{t=1}^{T}$ .

8 Conclusion

In this paper, we proposed and explored a new vertical method for large-scale regression called two-stage best-scored random forest (TBRF) by conducting a statistical learning treatment. This strategy is a just fit for the big data era for its computational efficiency by taking utmost advantage of the parallel computing. More valuable as it is, it is born to settle the boundary discontinuity which has long been a problem to the existing vertical strategies. The two-stage stands for dividing the original random tree splitting procedure into two. To elucidate, we first adopt an adaptive random partition method to split the feature space into different non-overlapping cells in stage one. This stage one serves as a preprocessing for the parallel computing. Then in stage two, we develop the child best-scored random trees for regression on cells, gather them together to form a parent tree. Here, best-scored means to select the best performing one. By utilizing the randomness consisting in the partition, ensemble learning can naturally come into being which smoothes the discontinuous boundaries, and the resulting forest therefore reaches excellent asymptotic smoothness. Moreover, the TBRF can also be recognized as an inclusive and versatile framework where different mainstream regression strategies such as support vector regression can be incorporated as value assignment approaches to leaves of trees. Consequently, there can be a lot of high effective and efficient variants of TBRF available to be chosen according to the specific data sets at hand. Various numerical experiments on synthetic data and real data are given to provide insight into our TBRF. Moreover, comparisons are conducted with other state-of-the-art vertical methods which once again verifies the effectiveness and high efficiency of our novel random forest model.

Acknowledgments

The authors are grateful to Professor Ingo Steinwart for his valuable comments and suggestions. Hanyuan Hang and Yingyi Chen are supported by fund for building world-class universities (disciplines) of Renmin University of China. Johan Suykens acknowledges support of ERC Advanced Grant E-DUALITY (787960), Research Council KU Leuven C14/18/068, FWO GOA4917N. The corresponding author is Yingyi Chen.

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bennett and Blue (1998) Kristin P. Bennett and Jennifer A. Blue. A support vector machine approach to decision trees. In Neural Networks Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Joint Conference on , volume 3, pages 2396–2401, 1998.
2Biau (2012) Gérard Biau. Analysis of a random forests model. The Journal of Machine Learning Research , 13:1063–1095, 2012.
3Breiman (2000) Leo Breiman. Some infinite theory for predictor ensembles. University of California at Berkeley Papers , 2000.
4Breiman et al. (1984) Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and regression trees . Wadsworth Statistics/Probability Series. Wadsworth Advanced Books and Software, Belmont, CA, 1984.
5Brodley and Utgoff (1992) Carla E. Brodley and Paul E. Utgoff. Multivariate versus univariate decision trees . University of Massachusetts, Department of Computer and Information Science Amherst, MA, 1992.
6Chang et al. (2010) Fu Chang, Chien-Yang Guo, Xiao-Rong Lin, and Chi-Jen Lu. Tree decomposition for large-scale SVM problems. The Journal of Machine Learning Research , 11:2935–2972, 2010.
7Collobert and Bengio (2001) Ronan Collobert and Samy Bengio. SV Mtorch: Support vector machines for large-scale regression problems. The Journal of machine learning research , 1(Feb):143–160, 2001.
8Devroye (1986) Luc Devroye. A note on the height of binary search trees. Journal of the Association for Computing Machinery , 33(3):489–498, 1986.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Two-stage Best-scored Random Forest for

Abstract

1 Introduction

2 Establishment of the Main Algorithm

2.1 Notations

2.2 Best-scored Random Trees

2.2.1 Purely Random Partition

2.2.2 Child Best-scored Random Tree

2.2.3 Parent Best-scored Random Tree

Proposition 1

Proposition 2

Assumption 3** (Joined best-scored decision tree spaces)**

2.3 Two-stage Best-scored Random Forest

2.3.1 Adaptive Random Partition of the Feature Space

2.3.2 Ensemble Forest

3 Main Results and Statements

3.1 Fundamental Assumption

Assumption 4

3.2 Oracle Inequality for Parent Best-scored Random Trees

Theorem 5

3.3 Learning Rates for Parent Best-scored Random Trees

Theorem 6

3.4 Learning Rates for Two-stage Best-scored Random Forest

Theorem 7

3.5 Comments and Discussions

4 Error Analysis

4.1 Bounding the Approximation Error Term

Proposition 8

4.2 Bounding the Sample Error Term

Definition 9** (VC dimension)**

Definition 10** (Covering Numbers)**

Definition 11** (Entropy Numbers)**

Definition 12** (Empirical Rademacher Average)**

Lemma 13

Lemma 14

Lemma 15

Lemma 16

5 Architecture Analysis

5.1 Algorithm Construction

5.2 A Subtle Match for Large-scale Regression

5.3 An Inclusive Framework

5.4 Illustrative Examples

6 Numerical Evaluation

6.1 Experimental Improvements

6.1.1 Adaptive Oblique Random Partition

6.1.2 Vacancy Filling

6.2 Parameter Analysis

6.3 Real Data Comparisons

7 Proof

Lemma 17

8 Conclusion

Assumption 3 (Joined best-scored decision tree spaces)

Definition 9 (VC dimension)

Definition 10 (Covering Numbers)

Definition 11 (Entropy Numbers)

Definition 12 (Empirical Rademacher Average)