TL;DR
The paper introduces Decision Stream, a novel deep graph-based decision model that merges similar nodes to address overfitting and complexity issues in traditional decision trees, demonstrating significant performance improvements across various tasks.
Contribution
It proposes a new architecture that merges nodes based on similarity, creating a deep directed acyclic graph instead of a tree, improving over standard decision tree methods.
Findings
Outperforms standard decision trees with up to 35% error reduction
Effective on diverse tasks including classification and regression
Creates deep decision graphs with hundreds of levels
Abstract
Various modifications of decision trees have been extensively used during the past years due to their high efficiency and interpretability. Tree node splitting based on relevant feature selection is a key step of decision tree learning, at the same time being their major shortcoming: the recursive nodes partitioning leads to geometric reduction of data quantity in the leaf nodes, which causes an excessive model complexity and data overfitting. In this paper, we present a novel architecture - a Decision Stream, - aimed to overcome this problem. Instead of building a tree structure during the learning process, we propose merging nodes from different branches based on their similarity that is estimated with two-sample test statistics, which leads to generation of a deep directed acyclic graph of decision rules that can consist of hundreds of levels. To evaluate the proposed solution, we…
| Model | Credit scoring | Tweet sentiments | Aileron control | MNIST | CIFAR-10 |
|---|---|---|---|---|---|
| DT | 9.73 | 45.2 | 25.5 | 12.5 | 13.9 |
| DS | 6.33 | 39.9 | 18.2 | 25.0 | 19.7 |
| DS | 6.36 | 38.8 | 16.4 | 10.3 | 13.8 |
| Model | Credit scoring | Tweet sentiments | CIFAR-10 | Aileron control | MNIST | |
|---|---|---|---|---|---|---|
| DT | Method | Gradient boosting | Random forest | |||
| Error, % | 7.62 | 30.4 | 13.2 | 23.9 | 2.91 | |
| DS | Method | Extremely randomized trees | ||||
| Error, % | 6.31 | 38.8 | 13.0 | 15.0 | 2.66 | |
| Condition | Model | Samples | Time, s | Error, % | |
|---|---|---|---|---|---|
| Classification | Same number of samples | DT | 39.2 2.31 | 18.2 2.33 | |
| DS | 62.4 3.14 | 0.38 0.08 | |||
| Ratio | 1 | 0.63 | 48 | ||
| Similar time | DT | 22.7 1.62 | 19.9 2.43 | ||
| DS | 5 104 | 16.7 1.12 | 6.36 0.83 | ||
| Ratio | 5 | 1.36 | 3 | ||
| Same accuracy | DT | 22.7 1.62 | 19.8 2.40 | ||
| DS | 104 | 3.82 0.23 | 19.6 2.28 | ||
| Ratio | 25 | 6 | 1 | ||
| Regression | Same number of samples | DT | 28.3 3.43 | 12.2 1.28 | |
| DS | 60.1 2.99 | 1.32 0.12 | |||
| Ratio | 1 | 0.47 | 9 | ||
| Similar time | DT | 28.3 3.43 | 12.2 1.28 | ||
| DS | 5 104 | 19.9 1.29 | 6,68 0.72 | ||
| Ratio | 20 | 1.42 | 1.83 | ||
| Same accuracy | DT | 17.1 2.15 | 13.2 0.13 | ||
| DS | 104 | 5.6 0.31 | 13.1 0.11 | ||
| Ratio | 25 | 3 | 1 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\tocauthor
Dmitry Ignatov and Andrey Ignatov
11institutetext: Russian Research Center of Huawei Technologies, Russia
22institutetext: ETH Zurich, Switzerland
22email: [email protected], [email protected]
Decision Stream:
Cultivating Deep Decision Trees
Dmitry Ignatov 11
Andrey Ignatov 22
Abstract
Various modifications of decision trees have been extensively used during the past years due to their high efficiency and interpretability. Tree node splitting based on relevant feature selection is a key step of decision tree learning, at the same time being their major shortcoming: the recursive nodes partitioning leads to geometric reduction of data quantity in the leaf nodes, which causes an excessive model complexity and data overfitting. In this paper, we present a novel architecture — a Decision Stream, — aimed to overcome this problem. Instead of building a tree structure during the learning process, we propose merging nodes from different branches based on their similarity that is estimated with two-sample test statistics, which leads to generation of a deep directed acyclic graph of decision rules that can consist of hundreds of levels. To evaluate the proposed solution, we test it on several common machine learning problems — credit scoring, twitter sentiment analysis, aircraft flight control, MNIST and CIFAR image classification, synthetic data classification and regression. Our experimental results reveal that the proposed approach significantly outperforms the standard decision tree learning methods on both regression and classification tasks, yielding a prediction error decrease up to 35 %.
keywords:
decision tree, data fusion, two-sample test statistic, distributed machine learning.
1 Introduction
With the recent growth of data amount available for analysis and exploration, there is an inevitable need of comprehensive and automated methods for intellectual data processing. Decision tree (DT) is one of the most popular techniques in this area, and due to robustness and efficiency this prediction model became a standard tool for machine learning and big data problems. The idea behind this method is to separate one complex decision rule into a union of primitive rules, which leads to another crucial property — DT can be easily interpreted by human compared to many other machine learning techniques.
The DT construction is performed by recursive data partitioning. At each stage the best splitting rule is determined, and data from the current node is divided into child nodes according to the selected criterion. The same procedure is recursively applied to all new nodes in the generated tree until the stopping condition is met. While being a fast and clear way of data splitting, the geometrical reduction of data quantity in the nodes leads to their exhaustion and causes poor generalization ability and data overfitting. Since multiple partitioning generates many nodes with the same or similar label distribution (especially in the lower layers), it looks quite natural to merge such nodes to diminish the problem of data exhaustion and to continually increase the purity of the separated samples.
In this paper, we propose a novel method for regression and classification tasks — a Decision Stream (DS), where decision branches are loosely split and merged like natural streams of a waterfall (Fig. 1). In contrast to the classical decision tree algorithm, the proposed method builds a deep directed acyclic graph with higher degree of connectivity by merging statistically indistinguishable nodes, which leads to reduction of the model width and better generalization due to more representative data samples. The split and merge operations are combined in this approach and repeated at each step of the iterative learning process. The performed experiments demonstrate that the proposed method achieves notably better results compared to the standard decision tree approach, at the same time showing high computational performance during training in distributed systems. The data and software related to this paper are available on GitHub111 https://github.com/aiff22/Decision-Stream.
The rest of the paper is organized as follows. Section 2 gives an overview of the related works. Section 3 presents in details the proposed approach, and Section 4 provides the experimental results obtained on the real-world problems as well as on synthetic data. Section 5 summarizes our conclusions.
2 Related Work
Decision trees have been extensively studied, and a large number of their modifications were developed during the past years. The proposed methods include the Iterative Dichotomiser 3 and its successor — C4.5 [1], Classification and Regression Tree (CART) [2], Chi-squared Automatic Interaction Detection (CHAID) [3], Quick, Unbiased, Efficient, Statistical Tree (QUEST) [4] and various modifications of these algorithms [4]–[8]. Despite the essential difference in the training procedure, they usually tend to show similar performance on many real-world regression and classification tasks [9]–[15].
The majority of these algorithms consider only node partitioning for decision tree construction, or use node merging as an auxiliary procedure that has no significant effect on the tree structure. For instance, C4.5 and CART algorithms as well as their modifications [4]–[8] perform only node splitting based on the selected features without any merging or fusion operations. QUEST algorithm merges several classes into two superclasses to produce one binary split [16]. In [17], the number of terminal nodes is reduced by fusing the leaves with similar predictions after the training is finished. The CHAID algorithm merges data samples within a node, which is equivalent to using a modified splitting criterion. Data samples are fused based on the significance of their similarity estimated by test statistics: test [3] for categorical label and F-test [18] for continuous.
A fundamentally different approach based on Occam’s razor concept was proposed for decision tree size reduction in [19], where decision graph is constructed on the basis of hill climbing heuristic by merging nodes from adjacent levels according to minimum message length principle with goal to produce a model of minimum size while preserving/increasing its accuracy. This technique has demonstrated an advantage over standard decision trees in experiments [20]–[22].
In this work, we present a Decision Stream algorithm that combines the classical decision tree learning method with a new procedure — statistically-based merging of nodes from the same and/or different levels of DS. The predictive model is growing till no improvements are achievable, considering different data recombinations, and resulting in deep directed acyclic graph architecture and statistically-significant data partition.
3 Decision Stream
In this section, we describe the proposed Decision Stream algorithm. The main concept of this method consists in merging similar nodes after each splitting iteration. The similarity is estimated using two-sample test statistics that is applied to compare the distribution of labels in each pair of nodes. The nodes are merged if the difference is statistically insignificant. This procedure eliminates the classical problem of decision trees — progressive decrease of data quantity in the leaf nodes, and produces a more general structure — a directed acyclic graph (Fig. 1), which can be extremely deep. A more detailed explanation of the algorithm is provided below.
3.1 Node Merging with Two-Sample Test Statistics
The overview of the merging operation is illustrated in the Fig. 2. After the classical decision tree branching, the merging algorithm takes as an input leaf nodes generated at the current stage (Fig. 2(a)) as well as previously obtained unsplit leaves from the upper levels of the model, and fuses statistically similar nodes (Fig. 2(b-c)) using an input parameter — significance threshold . Since the nodes are merged based on the similarity of their label distributions, the merging procedure can be considered as the statistically-based label clustering.
Merging Algorithm 1 consists of an outer and inner loop. In the outer loop the leaves are sorted in ascending order according to the number of associated samples. The inner loop consists of the following three steps:
Leaf is picked up from the head of the sorted collection. 2. 2.
For each (, ) pair we compute the similarity of two nodes and then take the leaf that corresponds to its highest value. The similarity is calculated by the function with two-sample test statistics (3.3). Function returns the significance level representing the probability that the mean values of labels associated with these two nodes are identical. 3. 3.
If the obtained significance level is above the threshold , the leaves and are merged into a new leaf with parents obtained by uniting the parents of the merged nodes.
3.2 Decision Stream Training
The whole DS training procedure is described in Algorithm 2, where each learning iteration consists of two steps. At the first step, DS grows using the classical decision tree branching operation — new nodes are created by splitting all current non-terminal leaves [2, 4, 12]. At the second step, the leaves are merged using the procedure described in Algorithm 1. A leaf is marked as terminal if it cannot be split into statistically different child nodes. The pair of splitting and merging steps is iteratively performed till the stopping criterion is met. If all leaves are terminal or the prediction accuracy is not improved, the DS training is finished and Algorithm 2 returns the reference to the root node of the generated DS. To estimate the prediction accuracy, we use a cross-node Gini impurity measure calculated for leaf nodes and classes:
[TABLE]
where and is the number of samples in all leaves and leaf node , respectively; is the fraction of samples of class in leaf .
3.3 Splitting/Merging Criteria
The splitting and merging operations are performed according to significance threshold . We take as the null hypothesis that labels of two nodes are from the same distribution and have the same mean value. The null hypothesis is rejected at the significance level , and in case of rejection we consider that the nodes are statistically different. The similarity is estimated by function with pair of two-sample test statistics. We use Z-test/Student’s t-test for labels with presumably normal distribution. The choice between the tests is determined according to rule [23]: Z-test is applied if the size of both data samples is greater than 30, Student’s t-test — otherwise. For labels with non-normal distribution we use Kolmogorov-Smirnov/Mann-Whitney U tests: the first one is applied if the size of data samples is greater than 2, the second — otherwise. We prefer Kolmogorov-Smirnov over Mann-Whitney U test since it is more sensitive to variations of both location and shape of the empirical cumulative distribution function [24], and provides better prediction accuracy in our experiments.
We propose two different versions of the split function bestSplit: one for relatively small datasets, where a precise selection of the split is crucial; and one for large-scale datasets where a trade-off between the accuracy and running time is important due to big amount of training samples.
3.4 Node Splitting for Non-Distributed Data
For non-distributed datasets the splitting is performed according to Algorithm 3, which takes as an input the significance threshold and a particular . Firstly, binary splits of the data within the is generated for each unique value of every feature. Then the similarity function is calculated for each split, and the one with the lowest significance of similarity is selected. If this significance is smaller than the input threshold , the selected best split is returned, otherwise — splitting is rejected and the node becomes terminal. Though this method is rather computationally expensive, it provides the best split quality and is reasonable for compact datasets.
3.5 Node Splitting for Distributed Data
Using the above algorithm for large-scale datasets is infeasible in most cases, thus we propose a different way of split selection designed for big data solutions. Instead of the greedy search, we perform data splitting based on the feature that is most correlated with label within a particular node [25]. Another difference of the proposed method is that it attempts to produce multiple leaves for each node as shown in Fig. 3, so far as the large number of samples presumes the robustness of such split.
Algorithm 4 demonstrates the body of the method. The procedure starts with function corr that selects the feature that is most correlated with the label. The obtained feature is then used to split the samples in the current node. If the feature is categorical, the samples are split by its categories, each one forming a leaf node. If the feature is continuous, all samples are firstly sorted according to values of the feature and then divided into ranges, where is a number of samples in the node. Samples from the same range are then associated with one leaf node (Fig. 3(a)). At the next step, the adjacent leaves are merged using Algorithm 1 with threshold until all neighboring nodes are statistically distinguishable (Fig. 3(b-c)). Finally, as soon as splitting with regard to categorical or continuous feature is finished, the obtained leaf nodes are merged again (this time not only adjacent ones) and the leaves providing statistically different predictions are returned.
The strength of correlation between the feature and label is estimated by function corr as described in Algorithm 5: if the feature and label are continuous, the correlation strength is calculated as coefficient of determination:
[TABLE]
otherwise it is computed as correlation ratio:
[TABLE]
Since both coefficients measure the same characteristics in discrete and continuous cases, we can compare the values obtained for different types of features to select the best one.
4 Experiments
In this section, we describe the experiments conducted to evaluate the performance of the proposed Decision Stream algorithm. The solution was tested on five common machine learning problems, and on large-scale synthetic classification/regression data.
4.1 Datasets
Credit scoring222https://www.kaggle.com/c/GiveMeSomeCredit/data/ — classification problem, 2 classes, 10 features, 100K training and 20K test samples.
Twitter sentiment analysis333http://alt.qcri.org/semeval2015/task10/ — classification problem, 3 classes (positive, negative, neutral), 500 features, 6500 training and 824 test samples. Features were generated using the bag-of-words model.
F16 aircraft control problem (Ailerons)444http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html — regression problem, 40 features, 7154 training and 6596 test samples.
MNIST handwritten digits classification555http://yann.lecun.com/exdb/mnist/ — 10 classes, 784 features, 60K training and 10K test samples.
CIFAR-10 image classification666https://www.cs.toronto.edu/~kriz/cifar.html — 10 classes, 1024 features, 50K training and 10K test samples. Features were extracted from the last convolutional layer of the pre-trained ResNet-18 [26] CNN.
To tune model parameters, the training data for each problem was split into training (90 %) and validation (10 %) subsets. The same data was used for training and testing both decision tree and Decision Stream algorithms.
To get the baseline accuracy, we used the Scikit-learn777http://scikit-learn.org (v. 0.18.1) implementation of decision trees that provides four splitting criteria (information gain, Gini impurity, variance reduction, mean absolute error) and supports pre-pruning procedure. The parameters of DT were tuned for each dataset, including best criterion selection and tree pre-pruning.
Additionally, DS and DT algorithms were tested on large-scale synthetic classification and regression data generated on the fly by Spark Performance Tests Suite888https://github.com/databricks/spark-perf/ (v. 1.6). Each generated sample consisted of 500 features (125 binary, 125 categorical with 20 categories and 250 continuous within interval [0, 1]) and represented binary classification and regression problems. The detailed classification and regression results are provided below.
4.2 Tuning the Significance Threshold
Significance threshold is the key parameter of DS algorithm, and in the first experiment our goal was to estimate its optimal value for each problem. The level of was tuned as follows: for each dataset we varied it between and 0.5 and for each value estimated the accuracy of DS on the validation set. For synthetic data the similarity of labels was estimated by unpaired two-sample Z-test and Students t-test, for all other datasets — by Kolmogorov-Smirnov and Mann-Whitney U nonparametric tests. For classification problems we use the standard accuracy metric, for regression tasks — the weighted absolute percentage error:
[TABLE]
where X and Y are validation samples and their corresponding labels, and are the label and the prediction for sample , respectively.
The results of the experiment are presented in Fig. 4. The best accuracy was achieved at the significance threshold that is equal to 0.005 for credit scoring, 0.05 for tweets, 0.02 for aileron control, 0.005 for MNIST, 0.01 for CIFAR-10 and 0.001 for synthetic data. The obtained values were used for DS training in the following experiments.
4.3 Classification and Regression Results for Non-Distributed Data
This section presents the results obtained using a Decision Stream implementation for non-distributed data. Along with the single DS and DT models, we train their ensembles generated using five methods: random forest [27], extremely randomized trees [28], gradient boosting [29] and bagging [30]. Table 1 shows the results for single DS, DT and DS models, where the last one denotes a DS with disabled merging phase.
We should note that DS is not equivalent to DT since in this version node splitting is performed only if the resulting child nodes are statistically distinguishable. The results demonstrate that disabling of merging phase leads to substantially different accuracy — while on the first dataset with relatively low complexity (2 classes, 10 features) it prevents minor overfitting, for other datasets with higher complexity (3–10 classes or continuous label, 40–1024 features) it results in an oversimplified tree model. Enabling the merging operation changes the situation: the growth doesn’t stop on the stage of simple predictive model that has many similar leaf nodes — merging operation fuses them and thus forces the training procedure to continue that can result in very deep decision graphs. Fig. 5 illustrates this oscillating behavior: the merge operation is performed till no more statistically distinguishable nodes can be produced. Table 1 demonstrates that this leads to significantly higher accuracy compared to the standard decision tree architecture: the error on the first four datasets is reduced by 34 %, 14 %, 35 % and 17 %, respectively.
Fig. 6 illustrates the dependency between the size and the predictive error of ensembles constructed from decision trees and Decision Streams. The best results for all datasets are summarized in Table 2.
As one can see, in all cases the best performance of Decision Stream ensemble was obtained when using the extremely randomized trees algorithm. The explanation of this effect is the following: in contrast to decision trees, the construction of Decision Streams involves a large number of recombinations caused by continuously repeating splitting and merging operations. The chances that DS will find the optimal solution are therefore higher compared to DT, but at the same time the resulting Decision Streams tend to provide less diverse results. The power of ensemble significantly depends on the diversity of predictors, which is thus lower in case of Decision Streams. Extremely randomized trees method partially solves this problem by using random features for training the DS, and therefore it tends to provide better final results compared to other methods.
In almost all cases the best results for Decision Stream are achieved by ensembles of size 500, with the only exception for twitter sentiment analysis problem. The greatest advantage of DS over the DT is obtained on the credit scoring and aileron control tasks: a single DS outperforms all DT ensembles. Overall, the Decision Stream based methods have shown the best results on four out of five datasets with an average advantage of 16 %.
4.4 Classification and Regression Results for Large-Scale Data
The next set of experiments is conducted using Apache Spark-based999http://spark.apache.org (v. 1.6) distributed realization of Decision Stream and decision tree algorithms. For the last one an open-source implementation from MLlib machine learning library is used. To perform the distributed computations, the models were running on a computer cluster with 4 nodes (48 executors), 12 cores and 50 GB of RAM per node. The algorithms were trained on synthetic data generated by Spark Performance Tests Suite for classification and regression problems.
Fig. 7 shows the classification error, the regression weighted absolute percentage error (Eq. 4) and the training time for DT with a depth ranging from 3 to 15 levels, and DS which depth is regulated automatically. According to the results, decision trees trained with variance reduction metric and depth restriction of 5 levels demonstrate the best accuracy in both classification and regression tasks and so are used in our further experiments.
The prediction error of Decision Stream algorithm ( ) is 9 — 48 times lower than the error obtained by DT. The explanation of this significant difference is in the fact that the generated synthetic data had a distribution that was close to normal, thus the used pair of Z-test/t-test was especially effective in this case. Another reason is that better accuracy was also obtained at the expense of higher running time of DS algorithm.
To find the time that is required for DS and DT to provide the same accuracy, and to compare the accuracy after corresponding training periods, the experiments with different quantity of training data and number of models in ensembles were carried out (Fig. 8). According to the empirical results presented in Table 3, it takes significantly lower amount of data and less training time for DS to provide the same quality of prediction as for DT in both classification and regression tasks; for comparable training time Decision Stream demonstrates significantly better accuracy.
Gradient boosting and random forest ensembles improve DT performance, though the minimal error of ensembles with 30 decision trees is still higher than the corresponding error of 30 Decision Streams: the difference reaches 46 — 48 times for classification and 5.9 — 8.3 times for regression tasks. Thus, the proposed modification of Decision Stream for large-scale data demonstrates faster training and better accuracy on both regression and classification tasks compared to DT algorithm.
5 Conclusion
In this paper we presented a novel decision tree based algorithm — a Decision Stream, which avoids the problems of data exhaustion and formation of unrepresentative data samples in decision tree nodes by merging the leaves from the same and/or different levels of the predictive model structure. By increasing the number of samples in each node and reducing the tree width, the proposed algorithm preserves statistically representative data and allows extremely deep graph architecture that can consist of hundreds of levels. The main parameter of the algorithm — significance threshold, determines the results of each split/merge operation and automatically defines the depth of the Decision Stream model.
The experiments demonstrated that Decision Stream algorithm shows a strong advantage over the standard decision tree learning methods on both regression and classification tasks in both versions: non-distributed for relatively small datasets, where a precise selection of the best data splits is crucial; and distributed, where a balance between the accuracy and computational performance should be maintained.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. R. Quinlan, ”C 4.5: Programs for machine learning,” Mach. Learn. , vol. 16, pp. 235–240, 1994.
- 2[2] L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, Classification and regression trees. Belmont, Wadsworth, 1984.
- 3[3] G. V. Kass, ”An exploratory technique for investigating large quantities of categorical data,” Appl. Stat. , vol. 29, pp. 119–127, 1980.
- 4[4] W.-Y. Loh, ”Fifty years of classification and regression trees,” Intern. Stat. Review , vol. 82, pp. 329–348, 2014.
- 5[5] A. Panhalkar and D. Doye, ”An outlook in some aspects of hybrid decision tree classification approach: a survey,” in ICDECT , 2016, pp. 85–95.
- 6[6] K. Kyoungok, ”A hybrid classification algorithm by subspace partitioning through semi-supervised decision tree,” Pattern Recogn. , vol. 60, pp. 157–163, 2016.
- 7[7] H. Zhao and X. Li, ”A cost sensitive decision tree algorithm based on weighted class distribution with batch deleting attribute mechanism,” Inform. Sciences , vol. 378, pp. 303–316, 2017.
- 8[8] J. Sanz, J. Fernandez, H. Bustince, C. Gradin, M. Fort´un and T. Belzunegui, ”A decision tree based approach with sampling techniques to predict the survival status of poly-trauma patients,” IJCIS , vol. 10, pp. 440–455, 2017.
