HDI-Forest: Highest Density Interval Regression Forest

Lin Zhu; Jiaxing Lu; Yihong Chen

arXiv:1905.10101·cs.LG·July 23, 2019

HDI-Forest: Highest Density Interval Regression Forest

Lin Zhu, Jiaxing Lu, Yihong Chen

PDF

Open Access 1 Repo

TL;DR

HDI-Forest introduces a novel random forest-based method for high-quality prediction interval estimation that improves efficiency and accuracy over existing neural network and linear model approaches.

Contribution

It proposes HDI-Forest, a new method that reuses standard random forest trees for prediction interval estimation without additional training.

Findings

01

Reduces average PI width by over 20%

02

Achieves comparable or better coverage probability

03

Outperforms previous methods on benchmark datasets

Abstract

By seeking the narrowest prediction intervals (PIs) that satisfy the specified coverage probability requirements, the recently proposed quality-based PI learning principle can extract high-quality PIs that better summarize the predictive certainty in regression tasks, and has been widely applied to solve many practical problems. Currently, the state-of-the-art quality-based PI estimation methods are based on deep neural networks or linear models. In this paper, we propose Highest Density Interval Regression Forest (HDI-Forest), a novel quality-based PI estimation method that is instead based on Random Forest. HDI-Forest does not require additional model training, and directly reuses the trees learned in a standard Random Forest model. By utilizing the special properties of Random Forest, HDI-Forest could efficiently and more directly optimize the PI quality metrics. Extensive…

Figures2

Click any figure to enlarge with its caption.

Tables2

Table 1. Table 1 : Characteristics of the datasets used in the experiments.

Dataset	Size	Dimensionality
Boston Housing	506	13
Parkinsons	5875	26
Wine Quality	1599	11
Forest Fires	517	13
Concrete Compression Strength	1030	9
Energy Efficiency	768	8
Naval Propulsion	11934	16
Combined Cycle Power Plant	9568	4
Protein Structure	45730	9
Communities	1994	128
Online News Popularity	39797	61

Table 2. Table 2 : Performance comparison of various methods. The results are averaged over 20 random runs, with best results in bold. Here, the best was chosen according to the strict rule that its performance should be equal to or better than all other methods measured by both PICP and MPIW. The last two rows show the average performance of all tested methods.

Dataset	Metrics	HDI-Forest	QRF	QR	${QR}_{GBDT}$	IntPred	QD-Ens
Boston Housing	PICP	0.93	0.92	0.92	0.92	0.92	0.91
Boston Housing	MPIW	1.02	1.19	2.28	1.70	1.79	1.16
Parkinsons	PICP	0.99	0.99	0.95	0.99	0.96	0.98
Parkinsons	MPIW	0.11	0.45	1.18	0.88	0.98	0.62
Wine Quality	PICP	0.92	0.92	0.91	0.92	0.92	0.92
Wine Quality	MPIW	1.11	3.33	2.64	2.47	2.54	2.33
Forest Fires	PICP	0.95	0.94	0.95	0.94	0.94	0.94
Forest Fires	MPIW	0.81	1.24	1.09	1.04	1.03	0.96
Concrete Compression Strength	PICP	0.94	0.94	0.94	0.94	0.94	0.94
Concrete Compression Strength	MPIW	1.18	1.22	2.22	2.23	1.87	1.09
Energy Efficiency	PICP	0.97	0.96	0.95	0.97	0.95	0.97
Energy Efficiency	MPIW	0.39	0.50	1.73	0.79	1.56	0.47
Naval Propulsion	PICP	0.99	0.96	0.98	0.98	0.98	0.98
Naval Propulsion	MPIW	0.24	0.68	0.89	1.34	0.73	0.28
Combined Cycle Power Plant	PICP	0.95	0.95	0.95	0.95	0.95	0.95
Combined Cycle Power Plant	MPIW	0.75	0.78	0.97	0.90	0.84	0.86
Protein Structure	PICP	0.95	0.94	0.95	0.95	0.95	0.95
Protein Structure	MPIW	1.77	1.82	2.76	2.36	2.15	2.27
Communities	PICP	0.92	0.92	0.89	0.92	0.92	0.87
Communities	MPIW	1.50	1.73	2.03	1.69	1.94	1.74
Online News Popularity	PICP	0.96	0.96	0.95	0.96	0.95	0.96
Online News Popularity	MPIW	1.18	1.98	1.27	1.72	1.38	1.60
Average Performance	PICP	0.95	0.95	0.94	0.95	0.94	0.94
Average Performance	MPIW	0.91	1.35	1.73	1.56	1.53	1.22

Equations66

P (l \leq Y \leq u ∣ X = x) \geq 1 - α .

P (l \leq Y \leq u ∣ X = x) \geq 1 - α .

P (Y \leq Q_{τ} (x) ∣ X = x) = τ .

P (Y \leq Q_{τ} (x) ∣ X = x) = τ .

l, u min u - l

l, u min u - l

s .t . P (l \leq Y \leq u ∣ X = x) \geq 1 - α .

\frac{MPIW}{r} (1 + exp (λ max (0, (1 - α) - PICP))),

\frac{MPIW}{r} (1 + exp (λ max (0, (1 - α) - PICP))),

MPI W_{capt} + λ \frac{n}{α ( 1 - α )} max (0, (1 - α) - PICP)^{2},

MPI W_{capt} + λ \frac{n}{α ( 1 - α )} max (0, (1 - α) - PICP)^{2},

w (x_{i}, x, θ) = \frac{I ( x _{i} \in X _{l (x, θ)} )}{{ j : x _{j} \in X _{l (x, θ)} }},

w (x_{i}, x, θ) = \frac{I ( x _{i} \in X _{l (x, θ)} )}{{ j : x _{j} \in X _{l (x, θ)} }},

y_{single-tree} (x, θ) = i = 1 \sum n w (x_{i}, x, θ) y_{i} .

y_{single-tree} (x, θ) = i = 1 \sum n w (x_{i}, x, θ) y_{i} .

y (x) = \frac{1}{m} i = 1 \sum m y_{single-tree} (x, θ_{i}) .

y (x) = \frac{1}{m} i = 1 \sum m y_{single-tree} (x, θ_{i}) .

y (x) = i = 1 \sum n w (x_{i}, x) y_{i},

y (x) = i = 1 \sum n w (x_{i}, x) y_{i},

w (x_{i}, x) = \frac{1}{m} i = 1 \sum m w (x_{i}, x, θ_{i}) .

w (x_{i}, x) = \frac{1}{m} i = 1 \sum m w (x_{i}, x, θ_{i}) .

F (y ∣ X = x)

F (y ∣ X = x)

= E (I (Y \leq y) ∣ X = x) .

F (y ∣ X = x) = i = 1 \sum n w (x_{i}, x) I (y_{i} \leq y) .

F (y ∣ X = x) = i = 1 \sum n w (x_{i}, x) I (y_{i} \leq y) .

P (l \leq Y \leq u ∣ X = x) = i = 1 \sum n w (x_{i}, x) I (l \leq y_{i} \leq u) .

P (l \leq Y \leq u ∣ X = x) = i = 1 \sum n w (x_{i}, x) I (l \leq y_{i} \leq u) .

l, u min u - l

l, u min u - l

s .t . P (l \leq Y \leq u ∣ X = x) \geq 1 - α .

u, l \in {y_{i}}_{i = 1}^{n} .

u, l \in {y_{i}}_{i = 1}^{n} .

P (l \leq Y \leq u ∣ X = x) \geq 1 - α .

P (l \leq Y \leq u ∣ X = x) \geq 1 - α .

l_{alt} = i min {y_{i} ∣ y_{i} \geq l, 1 \leq i \leq n},

l_{alt} = i min {y_{i} ∣ y_{i} \geq l, 1 \leq i \leq n},

u_{alt} = i max {y_{i} ∣ y_{i} < u, 1 \leq i \leq n} .

u_{alt} = i max {y_{i} ∣ y_{i} < u, 1 \leq i \leq n} .

u_{alt} - l_{alt} < u - l .

u_{alt} - l_{alt} < u - l .

P (l_{alt} \leq Y \leq u_{alt} ∣ X = x)

P (l_{alt} \leq Y \leq u_{alt} ∣ X = x)

= i = 1 \sum n w (x_{i}, x) I (l_{alt} \leq y_{i} \leq u_{alt})

= i = 1 \sum n w (x_{i}, x) I (l \leq y_{i} \leq u)

= P (l \leq Y \leq u ∣ X = x)

\geq 1 - α .

i, j min y_{j} - y_{i}

i, j min y_{j} - y_{i}

s .t . k = i \sum j w_{k} \geq 1 - α,

w_{k} = i = 1 \sum n I (y_{i} = y_{k}) w (x_{i}, x) .

w_{k} = i = 1 \sum n I (y_{i} = y_{k}) w (x_{i}, x) .

j min y_{j} - y_{i}

j min y_{j} - y_{i}

s .t . k = i \sum j w_{k} \geq 1 - α .

j min j

j min j

s .t . k = i \sum j w_{k} \geq 1 - α .

j_{opt} (i_{2}) \leq j_{opt} (i_{1}) .

j_{opt} (i_{2}) \leq j_{opt} (i_{1}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chancejohnstone/piRF
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Neural Networks and Applications · Machine Learning and Data Classification

Full text

HDI-Forest: Highest Density Interval Regression Forest

Lin Zhu111Contact Author

Jiaxing Lu

Yihong Chen

Ctrip Travel Network Technology Co., Limited.

{zhulb, lujx, yihongchen}@Ctrip.com

Abstract

By seeking the narrowest prediction intervals (PIs) that satisfy the specified coverage probability requirements, the recently proposed quality-based PI learning principle can extract high-quality PIs that better summarize the predictive certainty in regression tasks, and has been widely applied to solve many practical problems. Currently, the state-of-the-art quality-based PI estimation methods are based on deep neural networks or linear models. In this paper, we propose Highest Density Interval Regression Forest (HDI-Forest), a novel quality-based PI estimation method that is instead based on Random Forest. HDI-Forest does not require additional model training, and directly reuses the trees learned in a standard Random Forest model. By utilizing the special properties of Random Forest, HDI-Forest could efficiently and more directly optimize the PI quality metrics. Extensive experiments on benchmark datasets show that HDI-Forest significantly outperforms previous approaches, reducing the average PI width by over 20% while achieving the same or better coverage probability.

1 Introduction

Let ${D}_{XY}$ be an unknown joint distribution over instances $x\in\mathcal{X}$ and responses $y\in\mathbb{R}$ , where X, Y denote random variables, and x, y are their instantiations. A common goal shared by many predictive tasks is to infer certain properties of the conditional distribution ${{D}_{\left.Y\right|X}}$ . For example, in a standard regression task, we are given a training set of $\left\{\left({{x}_{i}},{{y}_{i}}\right)\right\}_{i=1}^{n}$ sampled i.i.d. from ${D}_{XY}$ and a new test instance $x$ sampled from ${D}_{X}$ , the goal is to predict $E\left(Y\left|X=x\right.\right)$ , namely the conditional mean of Y at $x$ . Although estimation of the mean is highly useful in practice, it conveys no information about the predictive uncertainty, which can be very important for improving the reliability and robustness of the predictions Khosravi et al. (2011b); Pearce et al. (2018).

Prediction intervals (PIs) is one of the most widely used tools for quantifying and representing the uncertainty of predictions. For a specified confidence level $\alpha$ , the goal of PI construction is to estimate the $100\left(1-\alpha\right)\%$ interval $\left[l,u\right]\in{{\mathbb{R}}^{2}}$ that will cover no less than $1-\alpha$ of the probability mass of ${{D}_{\left.Y\right|X=x}}$ , namely:

[TABLE]

PIs directly express uncertainty by providing lower and upper bounds for each prediction with specified coverage probability, and are more informative and useful for decision making than conditional means alone Pearce et al. (2018); Stine (1985).

A plethora of techniques have been proposed in the literature for construction of PIs Rosenfeld et al. (2018). However, the majority of existing works only consider the coverage probability criteria (1), and yet ignore other crucial aspects of the elicited PIs Khosravi et al. (2011b). In particular, there exists a fundamental trade-off between PI width and coverage probability, and (1) can always be trivially fulfilled by a large enough yet useless interval Rosenfeld et al. (2018). This motivates the development of quality-based PI elicitation principle, which seeks the shortest PI that contains the required amount of probability Pearce et al. (2018). So far, quality-based PI estimation has been applied to solve many practical problems, such as the predictions of electronic price Shrivastava et al. (2015), wind speed Lian et al. (2016), and solar energy Galván et al. (2017), etc.

Although some promising results have been shown, existing approaches for quality-based PI construction still have some limitations. Firstly, most of these methods are built upon deep neural networks (DNNs) Khosravi et al. (2011b); Pearce et al. (2018), and yet currently the quality-based PI learning principle has primarily been applied to handle tabular data. Despite the tremendous success of DNNs for various domains such as image and text processing, it is known that for tabular data, tree-based ensembles such as Random Forest Breiman (2001) and Gradient Boosting Decision Tree (GBDT) Friedman (2001) often perform better, and are more widely used in practice Klambauer et al. (2017); Huang et al. (2015); Feng et al. (2018). Secondly, quality-based PI learning objectives are generally non-convex, non-differentiable, and even discontinuous, and are thus difficult to optimize. Although existing methods partially solved this problem by optimizing continuous and differentiable surrogate functions instead Pearce et al. (2018), the overall predictive performance may be improved by resolving such a mismatch between the objective functions used in training and the final evaluation metrics used in testing.

Motivated by the above considerations, in this paper we propose Highest Density Interval Regression Forest (HDI-Forest), a novel quality-based PI estimation method based on Random Forest. HDI-Forest does not require additional model training, and directly reuses the trees learned in a standard Random Forest model. By utilizing the special properties of Random Forest introduced in previous works Lin and Jeon (2006); Meinshausen (2006), HDI-Forest could efficiently and more directly optimize the PI quality evaluation metrics. Experiments on benchmark datasets show that HDI-Forest significantly outperforms previous approaches in terms of PI quality metrics.

The rest of the paper is organized as follows. In Section 2 we review related work on prediction intervals. In Section 3, the mechanism of Random Forest is introduced, along with its interpretation as an approximate nearest neighbor method Lin and Jeon (2006). Using this interpretation, HDI-Forest is introduced in Section 4 as a generalization of Random Forest. Encouraging numerical results for benchmark data sets are presented in Section 5.

2 Related Work

2.1 Quantile-based PI Estimation

For the pair of random variables $\left(X,Y\right)$ , the conditional quantile ${{Q}_{\tau}}\left(x\right)$ is the cut point that satisfies:

[TABLE]

It is easy to verify that the equal-tailed interval $\left[l,u\right]=\left[{{Q}_{\alpha/2}}\left(x\right),{{Q}_{1-\alpha/2}}\left(x\right)\right]$ is a valid solution to (1). Based on this insight, the classic approach for constructing PIs would first estimate ${{D}_{\left.Y\right|X=x}}$ , and then estimate its quantiles as solutions Rosenfeld et al. (2018). If ${{D}_{\left.Y\right|X=x}}$ is assumed to have some parametric form (e.g., Gaussian), it is often possible to compute the desired quantiles in closed-form. However, explicit assumptions about the conditional distribution may be too restrictive for real-world data modeling Sharpe (1970), and considerable research has been devoted to tackle this limitation. For example, Quantile Regression Koenker and Hallock (2001) avoid explicit specification of the conditional distribution, and directly infer the conditional quantiles from data by minimizing an asymmetric variant of the absolute loss. On the other hand, re-sampling-based approaches would train a number of models on different re-sampled versions of the training dataset, where commonly used re-sampling techniques include leave-one-out Steinberger and Leeb (2016) and bootstrap Stine (1985), and then use the point forecasts of trained models to estimate the conditional quantiles. Due to the need for training multiple models, a major disadvantage of re-sampling-based methods is the high computational cost for large-scale datasets Rivals and Personnaz (2000); Khosravi et al. (2011b). A notable exception is quantile regression forest Meinshausen (2006), which estimates quantiles by using the sets of local weights generated by Random Forest. Quantile regression forest is based on the alternative interpretation of Random Forest as an approximate nearest neighbor method Lin and Jeon (2006), the details of which will be reviewed in Section 3.1.

In recent years, motivated by the impressive successes of Deep Neural Network (DNN) models in miscellaneous machine learning tasks, there has been growing interest in enhancing DNN algorithms with uncertainty estimation capabilities. For instance, Mean Variance Estimation (MVE) Khosravi and Nahavandi (2014); Lakshminarayanan et al. (2017) assumes the conditional ${{D}_{\left.Y\right|X=x}}$ to be Gaussian, and jointly predicts its mean and variance via maximum likelihood estimation (MLE), while Gal and Ghahramani (2016) propose using Monte Carlo dropout Srivastava et al. (2014) to estimate predictive uncertainty. We refer the interested reader to Pearce et al. (2018) for a up-to-date overview of related techniques. Despite the differences in learning principles, most existing DNN-based approaches still adopt the traditional strategy of extracting quantile-based equal-tailed intervals.

2.2 Quality-based PI Estimation

As demonstrated in the previous section, existing studies related to PI construction have tended to focus on the intermediate problems of estimating conditional distributions and quantiles, and yet very little attention has been paid to quantitatively examine the quality of elicited PIs Khosravi et al. (2011b). The coverage criteria (1) could be adopted to evaluate the constructed intervals, but (1) alone is not sufficient for determining a meaningful PI. For example, if the intervals are set to be wide enough (e.g., $\left[-\infty,+\infty\right]$ ), the true response values could always be contained therein. This phenomenon emphasizes a fundamental trade-off between coverage probability and width of the PI Khosravi et al. (2011a), and motivates the following quality-based criteria for determining optimal PIs Pearce et al. (2018):

[TABLE]

In other words, given any $0\leq\alpha\leq 1$ , we would like to find the shortest interval that covers the required probability mass. Such intervals are known as the highest density intervals in the statistical literature Box and Tiao (1973). An illustrative example that compares highest density and traditional quantile-based equal-tailed intervals is provided in Fig.1.

So far, a number of methods have been proposed to predict quality-based PIs. The key idea shared by these approaches is to infer model parameters by minimizing a loss function based on (3). The loss function typically consists of two parts, which respectively measures the mean PI width (MPIW) and PI coverage probability (PICP) on the training set Pearce et al. (2018). For example, Lower Upper Bound Estimation (LUBE) Khosravi et al. (2011b) trains neural networks by optimizing

[TABLE]

where $r$ is the numerical range of the response variable, $\lambda$ controls the trade-off between PICP and MPIW. A limitation of the LUBE loss (4) is that it is non-differentiable and hard to optimize. Quality-driven Neural Network Ensemble (QD-Ens) Pearce et al. (2018) instead minimizes

[TABLE]

where n is the training set size, $\text{MPI}{{\text{W}}_{\text{capt}}}$ denotes MPIW of points that fall into the predicted intervals. Experiments show that QD-Ens significantly outperforms previous neural-network-based methods for eliciting quality-based PIs. On the other hand, Rosenfeld et al. (2018) propose IntPred for quality-based PI construction in the batch learning setting, where PIs for a set of test points are constructed simultaneously. To deal with the non-differentiability of PI quality metrics, both QD-Ens and IntPred instead adopt proxy losses that can be minimized efficiently using standard optimization techniques.

The proposed HDI-Forest model is different from existing quality-based PI estimation works in multiple aspects. Firstly, existing methods mainly consider linear or DNN-based predictive functions, while HDI-Forest is built upon tree ensembles; secondly, HDI-Forest does not require local search heuristics or smooth/convex loss relaxation techniques, and could efficiently obtain global optimal solution of the non-differentiable objective function; finally, by exploiting the special property of Random Forest that will be discussed in the next section, HDI-Forest does not require model re-training for different trade-offs between PICP and MPIW.

3 Random Forest

A Random Forest is a predictor consisting of a collection of m randomized regression trees. Following the notations in Breiman (2001); Meinshausen (2006), each tree $T\left(\theta\right)$ in this collection is constructed based on a random parameter vector $\theta$ . In practice, $\theta$ could control various aspects of the tree growing process, such as the re-sampling of the input training set and the successive selections of variables for tree splitting. Once learned, the L leaves of $T\left(\theta\right)$ partition the input feature space $\mathcal{X}$ into L non-overlapping axis-parallel subspaces $\left\{{{\mathcal{X}}_{1}},{{\mathcal{X}}_{2}},\cdots,{{\mathcal{X}}_{L}}\right\}$ , then for any $x\in\mathcal{X}$ , there exists one and only one leaf $l\left(x,\theta\right)$ such that $x\in{{\mathcal{X}}_{l\left(x,\theta\right)}}$ . Meanwhile, the prediction of $T\left(\theta\right)$ for x is given by averaging over the observations that fall into ${{\mathcal{X}}_{l\left(x,\theta\right)}}$ . Concretely, let $w\left({{x}_{i}},x,\theta\right),1\leq i\leq m$ be defined as:

[TABLE]

where $\mathbb{I}\left(\cdot\right)$ is the indicator function and $\left|\cdot\right|$ denotes the cardinality of a set, then

[TABLE]

In Random Forest regression, the conditional mean of Y given $X=x$ is predicted as the average of predictions of m trees constructed with i.i.d. parameters ${{\theta}_{i}}$ , $1\leq i\leq m$ :

[TABLE]

3.1 Random Forest for Conditional Distribution Estimation

Although the original formulation of Random Forest only predicts the conditional mean, the learned trees could also be exploited to predict other interesting quantities Meinshausen (2006); Li and Martin (2017); Feng and Zhou (2018). For example, note that (8) can be rearranged as

[TABLE]

where

[TABLE]

Therefore, (8) could be alternatively interpreted as the weighted average of the response values of all training instances, and the weight for a specific instances ${{x}_{i}}$ measures the frequency that ${{x}_{i}}$ and x are partitioned into the same leaf in all grown trees, which offers an intuitive measure of similarity between them. Theoretically, it can also be shown that ${{x}_{i}}$ tends to be weighted higher if the conditional distributions ${{D}_{\left.Y\right|X=x}}$ and ${{D}_{\left.Y\right|X={{x}_{i}}}}$ are similar Lin and Jeon (2006). Furthermore, note that the conditional cumulative distribution function of $Y$ given $X=x$ can be written as:

[TABLE]

It could be proven that under certain conditions, (11) can be estimated using the weights from (10) as Meinshausen (2006):

[TABLE]

It has been demonstrated in Meinshausen (2006) that (12) can be exploited to accurately estimate conditional quantiles. In this work, we instead utilize it to perform quality-based PI estimation, as detailed in the next section.

4 Highest Density Interval Regression Forest

In the section, we describe the proposed HDI-Forest algorithm. Concretely, we first use the standard Random Forest algorithm to infer a number of trees from the data, then based on (12), for any observation x, the probability that the associated response value would fall into interval $[l,u]$ can be estimated as:

[TABLE]

Using (13), the quality-based criteria (3) can be approximated as the following optimization problem:

[TABLE]

Note that the optimization problem in (14) is non-convex since its constraint function is piece-wise constant and discontinuous. However, its global optimal solution can still be efficiently obtained by exploiting the problem structure, as detailed below.

Firstly, we present Theorem 1, which shows that the optimal solution of (14) must exist in a pre-defined finite set:

Theorem 1.

The optimal solution of (14) satisfies the following conditions:

[TABLE]

Proof.

Assume by contradiction that a pair of $[l,u]$ optimizes (14) and does not satisfy (15), then

[TABLE]

Let ${{l}_{\text{alt}}}$ and ${{u}_{\text{alt}}}$ be defined as

[TABLE]

Recall that $[l,u]$ does not satisfy (15), thus either ${{u}_{\text{alt}}}\neq u$ or ${{l}_{\text{alt}}}\neq l$ , and

[TABLE]

Meanwhile, by combining (16), (17), and (18), we have

[TABLE]

Equations (19) and (20) mean that $[{{l}_{\text{alt}}},{{u}_{\text{alt}}}]$ is a feasible and better solution than $[l,u]$ , which contradicts the assumption that $[l,u]$ optimizes (14). ∎

Let the unique elements of $\left\{{{y}_{i}}\right\}_{i=1}^{n}$ be arranged in increasing order as ${{\widetilde{y}}_{1}}<{{\widetilde{y}}_{2}}<\cdots<{{\widetilde{y}}_{\widetilde{n}}}$ . Then based on Theorem 1, (14) can be equivalently reformulated as

[TABLE]

where

[TABLE]

Problem (21) can then be solved simply by enumerating and evaluating all pairs of elements from $\left\{{{\widetilde{y}}_{i}}\right\}_{i=1}^{\widetilde{n}}$ , which nevertheless is still costly and takes $O\left({{\widetilde{n}}^{2}}\right)$ time per prediction. Fortunately, we can reduce the time complexity by rearranging the computations, so that the time is only linear to $O\left(\widetilde{n}\right)$ . The method is described below.

Firstly, (21) can be optimized using a two-stage approach instead: we start by solving the following optimization problem for each $1\leq i\leq\widetilde{n}$ :

[TABLE]

Then, let $\mathcal{I}\subseteq\left\{\left.i\right|i=1,2,\cdots,\widetilde{n}\right\}$ be the set of indices for which (23) has a feasible solution, and the optimal solution of (23) for $i\in\mathcal{I}$ be denoted as ${{j}_{{\text{opt}}}}\left(i\right)$ , it is easy to verify that (21) is optimized by the pair of $\left(i,{{j}_{\text{opt}}}\left(i\right)\right)$ that attains the smallest ${{\widetilde{y}}_{{{j}_{\text{opt}}}\left(i\right)}}-{{\widetilde{y}}_{i}}$ .

To compute ${{j}_{{\text{opt}}}}\left(i\right)$ for $i\in\mathcal{I}$ , we exploit the strict monotonicity of $\left\{{{\widetilde{y}}_{i}}\right\}_{i=1}^{\widetilde{n}}$ , and equivalently reformulate (23) as

[TABLE]

In other words, ${{j}_{{\text{opt}}}}\left(i\right)$ is simply the smallest index for which the constraint in (23) holds. Moreover, ${{j}_{{\text{opt}}}}\left(i\right)$ is monotonously increasing with respect to $i$ :

Theorem 2.

For any $1\leq{{i}_{2}}<{{i}_{1}}\leq\widetilde{n}$ , we have

[TABLE]

Proof.

Assume by contradiction that there exist ${{i}_{1}}$ and ${{i}_{2}}$ such that ${{j}_{\text{opt}}}\left({{i}_{2}}\right)>{{j}_{\text{opt}}}\left({{i}_{1}}\right)$ and ${{i}_{1}}>{{i}_{2}}$ , recall from the definitions in (6), (10) and (22) that ${{w}_{k}}\geq 0,1\leq k\leq\widetilde{n}$ , therefore

[TABLE]

On the other hand, based on the strict monotonicity of $\left\{{{\widetilde{y}}_{i}}\right\}_{i=1}^{\widetilde{n}}$ we have

[TABLE]

Equations (26) and (27) contradict the optimality of ${{{{j}_{{\text{opt}}}}\left({{i}_{2}}\right)}}$ and thereby we complete the proof. ∎

Based on the above analysis, in order to solve (23) for all $1\leq i\leq\widetilde{n}$ , we only need to walk down the sorted list of $\left\{{{\widetilde{y}}_{i}}\right\}_{i=1}^{\widetilde{n}}$ once to identify for each i the first index j such that $\sum\limits_{k=i}^{j}{{{w}_{k}}}\geq 1-\alpha$ . The whole algorithm is presented in Algorithm 1.

5 Experiments

5.1 Experimental Settings

5.1.1 Baseline Methods

Based on the survey of related works in Section 2, we adopted two types of baseline methods for comparison, including quantile-based methods and quality-based methods. Quantile-based methods include Quantile Regression Forest (QRF) Meinshausen (2006)222https://cran.r-project.org/web/packages/quantregForest/index.html, Quantile Regression (QR) Koenker and Hallock (2001)333https://cran.r-project.org/web/packages/quantreg/index.html, and Gradient Boosting Decision Tree with Quantile Loss ( $\text{QR}_{\text{GBDT}}$ ) implemented in the Scikit-learn package Pedregosa et al. (2011). On the other hand, quality-based PI methods include IntPred Rosenfeld et al. (2018) and Quality-Driven Ensemble (QD-Ens) Pearce et al. (2018)444https://github.com/TeaPearce/, the state-of-the-art approach for neural-network-based PI elicitation.

5.1.2 Benchmark Datasets

We compare various methods on 11 datasets from the UCI repository555http://archive.ics.uci.edu/ml/index.php. Statistics of these datasets are presented in Table 1. Each dataset is split in train and test sets according to a 80%-20% scheme, and we report the average performance over 10 random data splits. The hyper-parameters of all tested methods were tuned via 5-fold cross-validation on the training set.

5.1.3 Evaluation Metrics

Following previous works Khosravi et al. (2011b); Pearce et al. (2018); Rosenfeld et al. (2018), PICP and MPIW mentioned in Section 2 were adopted as the evaluation metrics.

5.2 Performance Comparison

As mentioned earlier, there exists a trade-off between the coverage probability (measured by the PICP metric) and width (measured by the MPIW metric) of extracted PIs. To facilitate the comparison of various methods, we first evaluate their performance when they achieve roughly the same level of PICP. The results are presented in Table 2. As can be seen, HDI-Forest significantly outperforms all other baselines for all but one dataset. Compared with the best-performing baseline method (QD-Ens), HDI-Forest could substantially reduce the average interval width by 34%, while achieving slightly better coverage probability.

To further compare the performance of HDI-Forest against QD-Ens and QRF, the two top-performing baselines, we examine the MPIW scores of three methods for a range of PICP values. As shown in Fig.2, HDI-Forest still achieves the best performance among all models.

6 Conclusion

In this paper, we propose HDI-Forest, a novel algorithm for quality-based PI estimation, extensive experiments on benchmark datasets show that HDI-Forest significantly outperforms previous approaches.

For future work, we plan to extend HDI-Forest to the batch learning setting, where the overall performance on a group of test instances can be further improved by adjusting the per-instance coverage probability constraints Rosenfeld et al. (2018). On the other hand, HDI-Forest is based on the original Random Forest model that is mainly suitable for standard regression/classification tasks, however, a large number of Random-Forest-based approaches have been proposed in the literature to handle other types of problems Sathe and Aggarwal (2017); Barbieri et al. (2016). It would also be interesting to study quality-based PI estimation for these models.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Barbieri et al. [2016] Nicola Barbieri, Fabrizio Silvestri, and Mounia Lalmas. Improving post-click user engagement on native ads via survival analysis. In WWW , pages 761–770, 2016.
2Box and Tiao [1973] George EP Box and George C Tiao. Bayesian inference in statistical analysis . Wiley, 1973.
3Breiman [2001] Leo Breiman. Random forests. Machine learning , 45(1):5–32, 2001.
4Feng and Zhou [2018] Ji Feng and Zhi-Hua Zhou. Autoencoder by forest. In AAAI , 2018.
5Feng et al. [2018] Ji Feng, Yang Yu, and Zhi-Hua Zhou. Multi-layered gradient boosting decision trees. In NIPS , pages 3555–3565, 2018.
6Friedman [2001] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics , pages 1189–1232, 2001.
7Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML , pages 1050–1059, 2016.
8Galván et al. [2017] Inés M Galván, José M Valls, Alejandro Cervantes, and Ricardo Aler. Multi-objective evolutionary optimization of prediction intervals for solar energy forecasting with neural networks. Information Sciences , 418:363–382, 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

HDI-Forest: Highest Density Interval Regression Forest

Abstract

1 Introduction

2 Related Work

2.1 Quantile-based PI Estimation

2.2 Quality-based PI Estimation

3 Random Forest

3.1 Random Forest for Conditional Distribution Estimation

4 Highest Density Interval Regression Forest

Theorem 1**.**

Proof.

Theorem 2**.**

Proof.

5 Experiments

5.1 Experimental Settings

5.1.1 Baseline Methods

5.1.2 Benchmark Datasets

5.1.3 Evaluation Metrics

5.2 Performance Comparison

6 Conclusion

Theorem 1.

Theorem 2.