Cost-complexity pruning of random forests

Kiran Bangalore Ravi; Jean Serra

arXiv:1703.05430·stat.ML·July 20, 2017

Cost-complexity pruning of random forests

Kiran Bangalore Ravi, Jean Serra

PDF

Open Access 1 Repo

TL;DR

This paper explores using out-of-bag samples for post-pruning decision trees within random forests, aiming to reduce model complexity while maintaining accuracy, based on empirical results from UCI datasets.

Contribution

It introduces a novel approach to improve random forest generalization by applying cost-complexity pruning using out-of-bag samples.

Findings

01

Reduced forest size without significant accuracy loss

02

Consistent improvement across multiple datasets

03

Effective post-pruning method for random forests

Abstract

Random forests perform bootstrap-aggregation by sampling the training samples with replacement. This enables the evaluation of out-of-bag error which serves as a internal cross-validation mechanism. Our motivation lies in using the unsampled training samples to improve each decision tree in the ensemble. We study the effect of using the out-of-bag samples to improve the generalization error first of the decision trees and second the random forest by post-pruning. A preliminary empirical study on four UCI repository datasets show consistent decrease in the size of the forests without considerable loss in accuracy.

Equations16

\overset{p}{^}_{t k} = \frac{1}{n _{t}} x_{i} \in R_{t} \sum I (y_{i} = k)

\overset{p}{^}_{t k} = \frac{1}{n _{t}} x_{i} \in R_{t} \sum I (y_{i} = k)

l (y, \overset{y}{^}) = \frac{1}{N _{t}} i \in R_{t} \sum I (y_{i} \neq = \overset{y}{^}_{i}) = 1 - \overset{p}{^}_{m t}

l (y, \overset{y}{^}) = \frac{1}{N _{t}} i \in R_{t} \sum I (y_{i} \neq = \overset{y}{^}_{i}) = 1 - \overset{p}{^}_{m t}

k \neq = k^{'} \sum \overset{p}{^}_{m t} \overset{p}{^}_{m t^{'}} = k = 1 \sum K \overset{p}{^}_{m t} (1 - \overset{p}{^}_{m t})

k \neq = k^{'} \sum \overset{p}{^}_{m t} \overset{p}{^}_{m t^{'}} = k = 1 \sum K \overset{p}{^}_{m t} (1 - \overset{p}{^}_{m t})

R_{α} (T) = R (T) + α \cdot ∣ Leaves (T) ∣

R_{α} (T) = R (T) + α \cdot ∣ Leaves (T) ∣

R (T) = t \in Leaves (T) \sum r (t) \cdot p (t) = t \in Leaves (T) \sum R (t)

R (T) = t \in Leaves (T) \sum r (t) \cdot p (t) = t \in Leaves (T) \sum R (t)

g (t) = \frac{R ( t ) - R ( T _{t} )}{∣ Leaves ( T _{t} ) ∣ - 1}

g (t) = \frac{R ( t ) - R ( T _{t} )}{∣ Leaves ( T _{t} ) ∣ - 1}

\mathcal{T}^{\ast}_{j}=\operatornamewithlimits{argmin}_{\alpha\in\mathcal{A}_{j}}\mathbb{E}\bigg{[}\|Y_{\text{OOB}}-\mathcal{T}_{j}^{(\alpha)}(X^{j}_{\text{OOB}})\|^{2}\bigg{]}

\mathcal{T}^{\ast}_{j}=\operatornamewithlimits{argmin}_{\alpha\in\mathcal{A}_{j}}\mathbb{E}\bigg{[}\|Y_{\text{OOB}}-\mathcal{T}_{j}^{(\alpha)}(X^{j}_{\text{OOB}})\|^{2}\bigg{]}

\{\mathcal{T}^{\ast}_{j}\}_{j=1}^{M}=\operatornamewithlimits{argmin}_{\alpha\in\cup_{j}\mathcal{A}_{j}}\mathbb{E}\bigg{[}\|Y_{\text{train}}-\frac{1}{M}\sum_{j=1}^{M}\mathcal{T}_{j}^{(\alpha)}(X^{j}_{\text{OOB}})\|^{2}\bigg{]}

\{\mathcal{T}^{\ast}_{j}\}_{j=1}^{M}=\operatornamewithlimits{argmin}_{\alpha\in\cup_{j}\mathcal{A}_{j}}\mathbb{E}\bigg{[}\|Y_{\text{train}}-\frac{1}{M}\sum_{j=1}^{M}\mathcal{T}_{j}^{(\alpha)}(X^{j}_{\text{OOB}})\|^{2}\bigg{]}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

beedotkiran/randomforestpruning-ismm-2017
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Neural Networks and Applications · Face and Expression Recognition

Full text

Cost-complexity pruning of random forests

B Ravi Kiran

CRIStAL Lab, UMR 9189, Université Charles de Gaulle, Lille 3 [email protected]

Jean Serra

Université Paris-Est, A3SI-ESIEE LIGM [email protected]

Abstract

Random forests perform boostrap-aggregation by sampling the training samples with replacement. This enables the evaluation of out-of-bag error which serves as a internal cross-validation mechanism. Our motivation lies in using the unsampled training samples to improve each decision tree in the ensemble. We study the effect of using the out-of-bag samples to improve the generalization error first of the decision trees and second the random forest by post-pruning. A prelimiary empirical study on four UCI repository datasets show consistent decrease in the size of the forests without considerable loss in accuracy. 111Previous version in proceedings ISMM 2017.

Keywords:Random Forests, Cost-complexity Pruning, Out-of-bag

1 Introduction

Random Forests [5] is an ensemble method which predicts by averaging over multiple instances of classifiers/regressors created by randomized feature selection and bootstrap aggregation (Bagging). The model is one of the most consistently performing predictor in many real world applications [6]. Random forests use CART decision tree classifiers [2] as weak learners. Random forests combine two methods : Bootstrap aggregation [3] (subsampling input samples with replacement) and Random subspace [11] (subsampling the variables without replacement). There has been continued work during the last decade on new randomized ensemble of trees. Extremely randomized trees [9] where instead of choosing the best split among a subset of variables under search for maximum information gain, a random split is chosen. This improves the prediction accuracy. In furthering the understanding of random forests [7] split the training set points into structure points: which decide split points but are not involved in prediction, estimation points: which are used for estimation. The partition into two sets are done randomly to keep consistency of the classifier.

Over-fitting occurs when the statistical model fits noise or misleading points in the input distribution, leading to poor generalization error and performance. In individual decision tree classifiers grown deep, until each input sample can be fit into a leaf, the predictions generalizes poorly on unseen data-points. To handle this decision trees are pruned. There has been a decade of study on the different pruning methods, error functions and measures [14], [16]. The common procedure follow is : 1. Generate a set in ”interesting trees”, 2. Estimate the true performance of each of these trees, 3. Choose the best tree. This is called post-pruning since we grow complete decision trees and then generate a set of interesting trees. CART uses cost-complexity pruning by associating with each cost-complexity parameter a nested subtree [10].

Though there has been extensive study on the different error functions to perform post-pruning [13], [17], there have been very few studies performed on pruning random forests and tree ensembles. In practice Random forests are quite stable with respect to parameter of number of tree estimators. They are shown to converge asymptotically to the true mean value of the distribution. [10] (page 596) perform an elementary study to show the effect tree size on prediction performance by fixing minimum node size (smaller it is the deeper the tree). This choice of the minimum node size are difficult to justify in practice for a given application. Furthermore [10] discuss that rarity of over-fitting in random forests is a claim, and state that this asymptotic limit can over-fit the dataset; the average of fully grown trees can result in too rich a model, and incur unnecessary variance. [15] demonstrates small gains in performance by controlling the depths of the individual trees grown in random forest.

Finally random forests and tree ensembles are generated by multiple randomization methods. There is no optimization of an explicit loss functions. The core principle in these methods might be interpolation, as shown in this excellent study [18]. Though another important principle is the use of non-parametric density estimation in these recursive procedures [1].

In this paper we are primarily motivated by the internal cross-validation mechanism of random forests. The out-of-the-bag (OOB) samples are the set of data points that were not sampled during the creation of the bootstrap samples to build the individual trees. Our main contribution is the evaluation of the predictive performance cost-complexity pruning on random forest and other tree ensembles under two scenarios :

Setting the cost-complexity parameter by minimizing the individual tree prediction error on OOB samples for each tree.
Setting the cost-complexity parameter by minimizing average OOB prediction error by the forest on all training samples.

In this paper we do not study ensemble pruning, where the idea is to prune complete instances of decision trees away if they do not improve the accuracy on unseen data.

1.1 Notation and Formulation

Let $Z=\{\mathbf{x}^{i},y^{i}\}_{N}$ be set of $N$ (input, output) pairs to be used in the creation of a predictor. Supervised learning consists of two types of prediction tasks : regression and classification problem, where in the former we predict continuous target variables, while in the latter we predictor categorical variables. The inputs are assumed to belong to space $X:=\mathbb{R}^{d}$ while $Y:=\mathbb{R}$ for regression and $Y:=\{C_{i}\}_{K}$ with $K$ different abstract classes. A supervised learning problem aims to infer the function $f:X\to Y$ using the empirical samples $Z$ that “generalizes” well.

Decision trees fundamentally perform data adaptive non-parametric density estimation to achieve classification and regression tasks. Decision trees evaluate the density function of the joint distribution $P(X,Y)$ by recursively splitting the feature space $X$ greedily, such that after each split or subspace, the $Y$ s in the children become “concentrated” or in some sense well partitioned. The best split is chosen by evaluating the information gain(chage in entropy) produced before and after a split. Finally at the leaves of the decision trees one is able to evaluate the class/value by observing the subspace (like the bin for histograms) and predicting the majority class respectively [8].

Given a classification target variable with $C_{k}=\{1,2,3,...K\}$ classes, we denote the proportion of of class $k$ in node as :

[TABLE]

which represents proportion of classifications in node $t$ in decision region $R_{t}$ with $n_{t}$ observations. The prediction in case of classification is performed by taking the majority vote in a leaf, i.e. $\hat{y}_{i}=\operatornamewithlimits{argmax}_{k}\hat{p}_{tk}$ . The misclassification error is given by :

[TABLE]

The general idea in classification trees is the recursive partition of $\mathbb{R}^{d}$ by axis parallel splits while maximizing the gini coefficient :

[TABLE]

In a decision split the parameters are the split dimension denoted by $j$ and the split threshold $c$ . Given an input decision region $S$ we are looking for the dimension (here in three dimensions) that minimizes entropy. Since we are splitting along $d$ unordered variables, there are $2^{d-1}-1$ possible partitions of the $d$ values into two groups (splits) and is computationally prohibitive. We greedily decide the best split on a subset of variables. We apply this procedure iteratively till the termination condition.

As shown in figure 1 the set of splits over which the splitting measure is minimized is determined by the coordinates of the training set points. The number of variables or dimension $d$ can be very large (100s-1000 in bio-informatics). Most frequently in CART one considers the sorted coordinates and from them the split points where the class $y$ change and finally one picks the split that minimizes the purity measure best.

1.2 Cost-Complexity Pruning

The decision splits near the leaves often provide pure nodes with very narrow decision regions that are over-fitting to a small set of points. This over-fitting problem is resolved in decision trees by performing pruning [2]. There are several ways to perform pruning : we study the cost-complexity pruning here. Pruning is usually not performed in decision tree ensembles, for example in random forest since bagging takes care of the variance produced by unstable decision trees. Random subspace produces decorrelated decision tree predictions, which explore different sets of predictor/feature interactions.

The basic idea of cost-complexity pruning is to calculate a cost function for each internal node. An internal node is all nodes that are not the leaves nor the root node in a tree. The cost function is given by [10]:

[TABLE]

where

[TABLE]

$R(T)$ is the training error, $\text{Leaves}(T)$ gives the leaves of tree $T$ , $r(t)=1-\max_{k}p(C_{k})$ is the misclassification rate and $p(t)=n_{t}/N$ is the number of samples in node $n_{t}$ to total training samples N. Now the variation in cost complexity is given by $R_{\alpha}(T-T_{t})-R_{\alpha}(T)$ , where $T$ is the complete tree, $T_{t}$ is the subtree with root at node $t$ , and a tree pruned at node $t$ would be $T-T_{t}$ . An ordering on the internal nodes for pruning is calculated by equating the cost-complexity function $R_{\alpha}$ of pruned subtree $T-T_{t}$ to that of the branch at node $t$ :

[TABLE]

The final step is to choose the weakest link to prune by calculating $\operatornamewithlimits{argmin}g(t)$ . This calculation of $g(t)$ in equation (6) and then pruning the weakest link is repeated until we are left with the root node. This provides a sequence of nested trees $\mathcal{T}$ and associated cost-complexity parameters $\mathcal{A}$ .

In figure 3 we plot the training error and test(cross-validation) error on 5 folds (usually 20 folds are used, this is only for visualization). We observe a deterioration in performance of both training and test errors. The small tree with 1 SE(standard error) of the cross-validation error is chosen as the optimal subtree. In our studies we use the simpler option which simply chooses the smallest tree with the smallest cross validation (CV) error.

2 Out-of-Bag(OOB) cost complexity Pruning

In Random forests, for each tree grown, $\frac{1}{e}N$ samples are not selected in bootstrap, and are called out of bag (OOB) samples. The value $\frac{1}{e}$ refers to the probability of choosing an out-of-bag sample when $N\to\infty$ . The OOB samples are used to provide an improved estimate of node probabilities and node error rate in decision trees. They are also a good proxy for generalization error in bagging predictors [4]. OOB data is usually used to get an unbiased estimate of the classification error as trees are added to the forest.

The out-of-bag (OOB) error is the average error on the training set $Z$ predicted such that, samples from the OOB-set $Z\setminus Z_{j}$ that do not belong to the set of trees $\{T_{j}\}$ are predicted with as an ensemble, using majority voting (using the sum of their class probabilities).

In our study (see figure 4) we use the OOB samples corresponding to a given tree $T_{j}$ in the random forest ensemble, to calculate the optimal subtree $T^{\ast}_{j}$ by cross-validation. There are two ways we propose to evaluate the optimal cost-complexity parameter, and thus the optimal subtree :

•

Independent tree pruning : calculate the optimal subtree by evaluating

[TABLE]

where $X^{j}_{\text{OOB}}=X_{\text{train}}\setminus X_{j}$ , and $X_{j}$ being the samples used in the creation of tree $j$ .

•

Global threshold pruning : calculate the optimal subtree by evaluating

[TABLE]

where the cross-validation uses the out-of-bag prediction error as to evaluate the optimal $\{\alpha_{j}\}$ values. This basically considers a single threshold of cost-complexity parameters, which chooses a forest of subtrees for each threshold. The optimal threshold is calculated by cross-validating over the training set.

The independent tree pruning and global threshold pruning are demonstrated in algorithmic form in figure 4 as functions, BestTree_byCrossValidation_Tree and BestTree_byCrossValidation_Forest. The main difference between them lies in the cross-validation samples and predictor (tree vs forest) used.

The decision function of the decision tree (also denoted by the same symbol) $T_{j}$ would ideally map an input vector $\mathbf{x}\in\mathbb{R}^{d}$ to any of the $C_{k}$ classes to be predicted. To perform the prediction for a given sample, we find nodes hit by the sample until it reaches each leaf in the DT, and predict its majority vote. The class-probability weighted majority vote across trees is frequently used since it provides a confidence score on the majority vote in each node across the different trees.

In algorithm (3) we evaluate the cost complexity pruning across the $M$ different trees $\{T_{j}\}$ in the ensemble, and obtain the optimal subtrees $\{T_{J}^{\ast}\}$ which minimize the prediction error on the OOB sample set $Z\setminus Z_{j}$ .

One of the dangers of using the OOB set to evalute optimal subtrees individually, is that in small datasets the OOB samples might no more be representative of the original training samples distribution, and might produce large cross-validation errors. Though it remains to be studied whether using the OOB samples as a cross-validation set would effectively reduce the generalization error for the forest, even if we observe reasonable performance.

3 Experiments and evaluation

Here we evaluate the Random Forest(RF), Extremely randomized tree(ExtraTrees, ET) and Bagged Trees (Bagger, BTs) models from scikit-learn on datasets from the UCI machine learning repository [12]. The data sets of different sizes are chosen. Datasets chosen were : Fisher’s Iris (150 samples, 4 features, 3 classes), red wine (1599 samples, 11 features, 6 classes), white wine (4898 samples, 11 features, 6 classes), digits dataset (1797 samples, 64 features, 10 classes). Code for the pruning experiments are available on github. 222https://github.com/beedotkiran/randomforestpruning-ismm-2017

In figure 6 we demonstrate the effect of pruning RFs, BTs and ETs on the different datasets. We observe that random forests an extra trees are often compressed by factors of 0.6 the original size, while maintaining test accuracies, while this is not the case with BTs. To understand the effect of pruning we plot in figure 7 the values $\mathcal{A}_{j}$ for the different trees in each of the ensembles. We observe that more randomization in RFs and ETs provide a larger set of potential subtrees to cross-validate over.

Another important observation is seen in figure 5, as we prune the forest globally, the forest’s accuracy on training set does not monotonically descend (as in the case of a decision tree). As we prune the forest, we could have a set of trees that improve their prediction while the others degrade.

4 Conclusions

In this preliminary study of pruning of forests, we studied cost-complexity pruning of decision trees in bagged trees, random forest and extremely randomized trees. In our experiments we observe a reduction in the size of the forest which is dependent on the distribution of points in the dataset. ETs and RFs were shown to perform better than BTs, and were observed to provide a larger set of subtrees to cross-validate. This is the main observation and contribution of the paper.

Our study shows that the out-of-bag samples can be a possible candidate to set the cost-complexity parameter and thus an determine the best subtree for all DTs within ensemble. This combines the two ideas originally introduced by Breiman OOB estimates [4] and bagging predictors [5], while using the internal cross-validation OOB score of random forests to set the optimal cost-complexity parameters for each tree.

The speed of calculation of the forest of subtrees is an issue. In the calculation of the forest of subtrees $\{\mathcal{T}\}_{j=1}^{M}$ we evaluate the predicitions at Unique( $\cup_{j}\{\mathcal{A}_{j}\}$ ) different values of the cost-complexity parameter, which represents the number of subtrees in the forest. In future work we propose to calculate the cost complexity parameter for the forest instead of individual trees.

Though these performance results are marginal, the future scope and goal of this study is to identify the sources of over-fitting in random forests and reduce this by post-pruning. This idea might not be incompatible with smooth-spiked averaged decision function provided by random forests [18].

Bibliography18

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Biau, G., Devroye, L., Lugosi, G.: Consistency of random forests and other averaging classifiers. JMLR 9, 2015–2033 (2008)
2[2] Breiman, L., H. Friedman, J., A. Olshen, R., J. Stone, C.: Classification and Regression Trees. Chapman and Hall, New York (1984)
3[3] Breiman, L.: Bagging predictors. Machine learning 24(2), 123–140 (1996)
4[4] Breiman, L.: Out-of-bag estimation. Tech. rep., Statistics Department, University of California Berkeley (1996)
5[5] Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001)
6[6] Criminisi, A., Shotton, J., Konukoglu, E.: Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Found. Trends. Comput. Graph. Vis. 7, 81–227 (feb 2012)
7[7] Denil, M., Matheson, D., De Freitas, N.: Narrowing the gap: Random forests in theory and in practice. In: ICML. pp. 665–673 (2014)
8[8] Devroye, L., Györfi, L., Lugosi, G.: A probabilistic theory of pattern recognition. Applications of mathematics, Springer, New York, Berlin, Heidelberg (1996)