Tight Mixed-Integer Optimization Formulations for Prescriptive Trees

Max Biggs; Georgia Perakis

arXiv:2302.14744·math.OC·May 20, 2025

Tight Mixed-Integer Optimization Formulations for Prescriptive Trees

Max Biggs, Georgia Perakis

PDF

Open Access

TL;DR

This paper develops tighter mixed-integer optimization formulations for modeling decision trees and ensembles, improving computational efficiency and solution quality in prescriptive analytics.

Contribution

It introduces a novel polyhedral formulation for single trees and enhances existing binary vector formulations for ensembles, reducing fractional solutions and solving times.

Findings

01

Tighter formulations remove fractional extreme points.

02

Formulations based on binary vectors perform well computationally.

03

Significant improvements in solving time to optimality.

Abstract

We focus on modeling the relationship between an input feature vector and the predicted outcome of a trained decision tree using mixed-integer optimization. This can be used in many practical applications where a decision tree or tree ensemble is incorporated into an optimization problem to model the predicted outcomes of a decision. We propose tighter mixed-integer optimization formulations than those previously introduced. Existing formulations can be shown to have linear relaxations that have fractional extreme points, even for the simple case of modeling a single decision tree. A formulation we propose, based on a projected union of polyhedra approach, is ideal for a single decision tree. While the formulation is generally not ideal for tree ensembles or if additional constraints are added, it generally has fewer extreme points, leading to a faster time to solve, particularly if the…

Tables2

Table 1. Table 1 : Problem sizes for instance with 5 features

# trees	method	constraints	binary variables	nonzeros
1	projected	11	2766	27709
	misic	8276	5521	54917
	bigM	16560	8287	41398
	elbow	8865	5521	61927
2	projected	22	5627	56857
	misic	16873	11242	112953
	bigM	33720	16869	84296
	elbow	18038	11242	128593
4	projected	44	11312	114060
	misic	34003	22610	225842
	bigM	67818	33922	169537
	elbow	36404	22610	254902
8	projected	88	22832	227507
	misic	68964	45646	453909
	bigM	136914	68478	342269
	elbow	73692	45646	520911
16	projected	176	45206	455015
	misic	137322	90386	911007
	bigM	271110	135592	677743
	elbow	146789	90386	1032816
32	projected	352	91990	924083
	misic	282939	183938	1847111
	bigM	551718	275928	1379231
	elbow	302335	183938	2097640

Table 2. Table 2 : Time taken to solve to optimality

	truncated mean (s)				$%$ greater 1800s
trees	projected	misic	bigM	elbow	projected	misic	bigM	elbow
1	0.47	0.98	1.00	0.75	0	0	0	0
2	0.92	2.09	1.96	1.67	0	0	0	0
4	2.16	6.83	6.15	5.82	0	0	0	0
8	8.50	49.14	56.16	36.82	0	0	0	0
16	103.30	1111.25	628.49	914.28	0	0.42	0.14	0.38
32	983.29	1552.09	1477.53	1363.65	0.32	0.76	0.66	0.7
geometric mean	9.67	32.52	29.27	26.35

Equations191

L_{l} = {w, y ∣ w_{i}

L_{l} = {w, y ∣ w_{i}

w_{i}

y

g r (f; D) = P r o j_{w, y} (Q \cap R^{d + 1 + n} \times {0, 1}^{m})

g r (f; D) = P r o j_{w, y} (Q \cap R^{d + 1 + n} \times {0, 1}^{m})

ext (Q) \subseteq R^{d + 1 + n} \times {0, 1}^{m}

ext (Q) \subseteq R^{d + 1 + n} \times {0, 1}^{m}

x_{ij} = {1 if w_{i} \leq θ_{ij} 0 if w_{i} \geq θ_{ij}

x_{ij} = {1 if w_{i} \leq θ_{ij} 0 if w_{i} \geq θ_{ij}

Q^{mi s i c} = {x, y, z ∣

Q^{mi s i c} = {x, y, z ∣

l \in right (s) \sum z_{l} \leq 1 - x_{V (s) C (s)} \forall s \in splits (t)

x_{ij} \leq x_{ij + 1} \forall i \in [d], \forall j \in [K_{i}]

l = 1 \sum p z_{l} = 1, y = l = 1 \sum p s_{l} z_{l}

x \in [0, 1]^{K_{i}} \forall i \in [d], z \geq 0}

{x, z ∣ z_{2}

{x, z ∣ z_{2}

z_{1}

{x, w ∣

{x, w ∣

w + 15 (1 - x_{12}) \geq 5,

Q^{e x t} = {w, y, w^{l}, y^{l}, z ∣

Q^{e x t} = {w, y, w^{l}, y^{l}, z ∣

b_{l i} z_{l} \leq w_{i}^{l} \forall i \in [d], \forall l \in [p]

y^{l} = s_{l} z_{l}, \forall l \in [p]

l = 1 \sum p z_{l} = 1,

w_{i} = l = 1 \sum p w_{i}^{l} \forall i \in [d]

y = l = 1 \sum p y^{l} \forall l \in [p]

z_{l} \in [0, 1] \forall l \in [p]}

Q^{p r o j} = {w, y, z ∣

Q^{p r o j} = {w, y, z ∣

l = 1 \sum p b_{l i} z_{l} \leq w_{i} \forall i \in [d],

y = l = 1 \sum p z_{l} s_{l}

l = 1 \sum p z_{l} = 1

z_{l} \in [0, 1] \forall l \in [p]}

Q^{f a ce t} = {w, y, z ∣

Q^{f a ce t} = {w, y, z ∣

b_{p i} + l = 1 \sum p - 1 (b_{l i} - b_{p i}) z_{l} \leq w_{i} \forall i \in [d],

y = s_{p} + l = 1 \sum p - 1 z_{l} (s_{l} - s_{p})

z_{l} \in [0, 1] \forall l \in [p - 1]}

conv (g r (f; D)) = P r o j_{w, y} (Q)

conv (g r (f; D)) = P r o j_{w, y} (Q)

f^{(1)} (w) = {10 \leq w \leq 1 41 < w \leq 3 f^{(2)} (w) = {20 \leq w \leq 2 32 < w \leq 3

f^{(1)} (w) = {10 \leq w \leq 1 41 < w \leq 3 f^{(2)} (w) = {20 \leq w \leq 2 32 < w \leq 3

0.5 (f^{(1)} (w) + f^{(2)} (w)) = ⎩ ⎨ ⎧ 1.5 0 \leq w \leq 1 31 < w \leq 2 3.5 2 < w \leq 3

0.5 (f^{(1)} (w) + f^{(2)} (w)) = ⎩ ⎨ ⎧ 1.5 0 \leq w \leq 1 31 < w \leq 2 3.5 2 < w \leq 3

{w, y, z ∣

{w, y, z ∣

z_{1}^{(1)} + 3 z_{2}^{(1)} \geq w,

{w_{1}, w_{2}, z ∣

{w_{1}, w_{2}, z ∣

2 z_{3} \leq w_{1},

Q^{e x p se t} = {x, y, z ∣

Q^{e x p se t} = {x, y, z ∣

l \in above (s) \sum z_{l} \leq 1 - x_{V (s) C (s)} \forall s \in splits (t)

x_{ij} \leq x_{ij + 1} \forall i \in [p], \forall j \in [K_{i}]

l \sum p z_{l} = 1, y = l = 1 \sum p s_{l} z_{l}

x \in [0, 1]^{K_{i}} \forall i \in [d], z \geq 0}

Q^{e x p se t} \cap ({0, 1}^{p} \times R^{1 + p}) = Q^{mi s i c} \cap ({0, 1}^{p} \times R^{1 + p}), but Q^{e x p se t} \subseteq Q^{mi s i c}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Statistical Methods and Inference · Advanced Multi-Objective Optimization Algorithms

Full text

\DoubleSpacedXI\RUNAUTHOR

Biggs and Perakis \stackMath

\TheoremsNumberedThrough

\RUNTITLE

Tightness of prescriptive tree-based mixed-integer optimization formulations

\TITLE

Tightness of prescriptive tree-based mixed-integer optimization formulations

\ARTICLEAUTHORS\AUTHOR

Max Biggs \AFFDarden School of Business, University of Virginia, \[email protected] \AUTHORGeorgia Perakis \AFFSloan School of Management, Massachusetts Institute of Technology \EMAIL

\ABSTRACT

We focus on modeling the relationship between an input feature vector and the predicted outcome of a trained decision tree using mixed-integer optimization. This can be used in many practical applications where a decision tree or tree ensemble is incorporated into an optimization problem to model the predicted outcomes of a decision. We propose tighter mixed-integer optimization formulations than those previously introduced. Existing formulations can be shown to have linear relaxations that have fractional extreme points, even for the simple case of modeling a single decision tree. A formulation we propose, based on a projected union of polyhedra approach, is ideal for a single decision tree. While the formulation is generally not ideal for tree ensembles or if additional constraints are added, it generally has fewer extreme points, leading to a faster time to solve, particularly if the formulation has relatively few trees. However, previous work has shown that formulations based on a binary representation of the feature vector perform well computationally and hence are attractive for use in practical applications. We present multiple approaches to tighten existing formulations with binary vectors, and show that fractional extreme points are removed when there are multiple splits on the same feature. At an extreme, we prove that this results in ideal formulations for tree ensembles modeling a one-dimensional feature vector. Building on this result, we also show via numerical simulations that these additional constraints result in significantly tighter linear relaxations when the feature vector is low dimensional. We also present instances where the time to solve to optimality is significantly improved using these formulations.

\KEYWORDS

Tree ensembles, Prescriptive analytics, Mixed-integer optimization

1 Introduction

A fundamental problem in operations research and management science is decision-making under uncertainty. Recently, attention has been given to modeling uncertain outcomes using machine learning functions, trained from previous decisions made under a variety of circumstances (Bertsimas et al. 2016, Cheng et al. 2017, Tjeng et al. 2017, Boob et al. 2022, Anderson et al. 2018, Bunel et al. 2018, Fischetti and Jo 2018, Kumar et al. 2019, Mišić 2020, Biggs et al. 2022, Bergman et al. 2022). Due to the complex nature of real-world decision-making, often the model that best represents the outcomes observed is nonlinear, such as a neural network or a tree ensemble. This leads to a potentially complex optimization problem for the decision-maker to find the best decision, as predicted by the machine learning function.

An example of this occurs in reinforcement learning, where the future reward resulting from a decision is uncertain but can be approximated using machine learning models, such as decision trees or tree ensembles. In some applications, such as playing Atari video games (Mnih et al. 2015), the decision set is small so all the decisions can be enumerated and evaluated. In comparison, in many real-world operational problems – for example, dynamic vehicle routing problems (Bent and Van Hentenryck 2007, Pillac et al. 2011) or kidney transplantation (Sönmez and Ünver 2017, Ashlagi et al. 2018)– complex decisions whose outcomes are uncertain need to be made at every stage of an online process. These decisions are often high dimensional or combinatorial in nature and subject to constraints on what is feasible. This can result in a very large action space. As a result, enumeration is no longer a tractable option, and a more disciplined optimization approach must be taken. Furthermore, the selection of the best action is further complicated by the nonlinear value function approximation.

One approach to finding optimal decisions when the outcome is estimated using a complex machine learning method is to use mixed-integer optimization (MIO) to model this relationship. In particular, there has recently been significant interest in modeling trained neural networks, by encoding these relationships using auxiliary binary variables and constraints (Cheng et al. 2017, Tjeng et al. 2017, Anderson et al. 2018, Bunel et al. 2018, Fischetti and Jo 2018, Kumar et al. 2019, Wang et al. 2021). Another popular and powerful approach for supervised learning, yet one that is less studied in the prescriptive setting, is tree ensemble methods. Mišić (2020) provides unconstrained optimization examples in drug discovery, where a tree ensemble predicts a measure of the activity of a proposed compound, and customized price optimization, where a tree ensemble predicts the profit as a function of prices and store-level attributes. Biggs et al. (2022) provide examples in real estate development of maximizing the sale price of a new house that is predicted as a function of construction decisions and location features, and a method for creating fair juries based on jurors’ predicted a priori propensities to vote guilty or not due to their demographics and beliefs. These applications have nontrivial constraints, but can be represented as polyhedra with integer variables. Additional applications of trained decision trees or tree ensembles embedded in an optimization problem include retail pricing (Ferreira et al. 2015), assortment optimization (Chen et al. 2019, Chen and Mišić 2022), last-mile delivery (Liu et al. 2021), optimal power flow (Halilbašić et al. 2018), auction design (Verwer et al. 2017), constraint learning (Maragno et al. 2021) and Bayesian optimization (Thebelt et al. 2021).

The goal in these works is often to propose tractable optimization formulations, which allow large problem instances to be solved in a reasonable amount of time. An important consideration when formulating these mixed-integer optimization formulations is how tight, or strong, the formulation is. Most methods for optimizing mixed-integer formulations involve relaxing the integrality requirements on variables and solving a continuous optimization problem. In the popular branch and bound algorithm, if the optimal solution is fractional for integer variables, then multiple subproblems are created with added constraints to exclude the fractional solution. If there are fewer fractional solutions for the relaxed problem, corresponding to a tighter formulation, this can result in a significantly faster time to solve. Furthermore, some problems can be formulated in such a way that the linear relaxation doesn’t have any fractional extreme points, known as an ideal formulation. Oftentimes these ideal formulations can be solved extremely quickly.

Another benefit of stronger formulations is that the linear programming (LP) relaxations provide tighter upper bounds, which are also useful in many applications. An example of this is evaluating the robustness of a machine learning model (Carlini and Wagner 2017, Dvijotham et al. 2018). If an input can be perturbed by a practically insignificant amount and result in a significantly different prediction, this suggests that the model is not robust. Evaluating robustness can be formulated as a constrained optimization problem over local inputs to find the maximally different output. As finding the exact optimal bound can be time-consuming, often an upper bound on how much the solution could change is sufficient.

1.1 Contributions

We model the relationship between the input feature vector and the predicted output for a trained decision tree. This can be used in a range of optimization applications involving decision trees or tree ensembles. We present a novel mixed-integer optimization formulation based on a projected union of polyhedra approach, which we prove is ideal for a single tree. We show that existing mixed-integer optimization formulations for modeling trees, such as Biggs et al. (2022) or Mišić (2020) do not have this property. We also show that the constraints in our model are facet-defining. While this formulation is generally not ideal when we impose polyhedral constraints on the decision, or when multiple trees are used in an ensemble model, the formulation generally excludes fractional extreme points present in Biggs et al. (2022) and Mišić (2020), leading to tighter formulations.

We also present new formulations that use a binary representation of the feature vector as proposed in Mišić (2020). While these variables are more difficult to incorporate into a constrained optimization formulation, they do have some advantages when it comes to the branching behavior in the MIO solver, leading to a faster time to solve in some instances. We propose different constraints that can be added to tighten the formulation from Mišić (2020). The expset formulation is based on exploiting the greater than or equal to representation of the feature vector from Mišić (2020), leading to larger groups of leaf variables being turned off when a split is made. The elbow formulation removes specific fractional solutions that arise when there are nested branches on the same feature in a tree. We characterize the conditions in which each of these constraints removes fractional solutions, which generally occurs in scenarios where there are multiple splits on the same feature. Extending this, we show that the expset formulation leads to an ideal formulation when all the splits are on the same feature, which occurs for tree ensembles when the feature vector is one-dimensional. This property doesn’t hold for the formulation in Mišić (2020). In conjunction with the union of polyhedra formulation being ideal for a single tree with multiple features, this result provides insights for the practitioner on when different formulations might be tighter. While not directly comparable due to the use of different variables, when there are many trees in the ensemble but relatively few variables, the expset formulation is likely to be tighter. When there are few trees but many variables, the union of polyhedra formulation is likely to be tighter.

We explore the performance of these approaches through extensive simulations. In partial agreement with our theoretical findings, we show that in some instances, the union of polyhedra formulation appears to have significant solve time improvements for tree ensembles with few trees. Similarly, the elbow offers improvements for problems with few features. While the expset formulation generally doesn’t offer faster solve times, we show that the linear relaxations it provides can be significantly stronger which is useful in many applications where a bound on the optimal solution is desired, particularly for trees with few features.

2 Preliminaries

Given a feature vector $\bm{w}\in D\subseteq\mathbb{R}^{d}$ , our goal is to model the output of a decision tree $f^{(t)}(\bm{w})$ using a mixed-integer optimization formulation. More formally, we model the graph, $gr(f^{(t)};D)=\{\bm{w},y_{t}|\bm{w}\in D,y_{t}=f^{(t)}(\bm{w})\}$ . With such a formulation, we can easily model a range of practical applications, such as finding the optimal feature vector to maximize the predicted outcome of a tree ensemble $\sum_{t=1}^{T}y_{t}$ , or solving a reinforcement learning subproblem with complex constraints where the value function is given by a decision tree.

2.1 Decision trees

A decision tree $f^{(t)}(\bm{w})$ with $p$ leaves is a piecewise constant function, where a constant outcome $s_{l}$ is predicted if feature vector $\bm{w}$ falls within a particular leaf $\mathcal{L}_{l},l\in[p]$ , so that $f^{(t)}(\bm{w})=s_{l}~{}\text{if}~{}\bm{w}\in\mathcal{L}_{l}$ . Each leaf, $\mathcal{L}_{l}$ , is a hyperrectangular set defined by an upper $u_{il}$ and a lower (bottom) $b_{il}$ bound for each feature dimension $w_{i},i\in~{}[d]$ . Throughout, we assume $w_{i}$ is bounded. A leaf is defined as:

[TABLE]

The upper bounds and lower bounds associated with each leaf are defined by a hierarchy of axis-aligned splits. We use the often-used convention that the splits in the tree are of the form $w_{i}\leq\theta$ (Pedregosa et al. 2011). These splits define the tree and partition the feature space into leaves. We denote $\textbf{splits}(t)$ as the set of splits corresponding to tree $t\in T$ , $\textbf{left}(s)$ as the set of leaves to the left of split $s$ in the tree (i.e., those that satisfy the split condition $w_{i}\leq\theta$ ), and $\textbf{right}(s)$ as the set of leaves to the right for which $w_{i}>\theta$ . The upper bounds $u_{il}$ are defined by the threshold of the left splits that lead to the leaf, while the lower bounds $b_{il}$ are defined by the thresholds of the right splits. In the case where there are multiple axis-aligned splits along a dimension leading to a leaf (i.e., $w_{1}\leq 5$ then $w_{1}\leq 2$ ), the upper bound will be the minimum of all less than splits, while the lower bound will be the maximum. When there are no splits on a feature, the upper and lower bounds on the leaf are the upper and lower bounds on the feature vector.

2.2 Mixed-integer optimization

Our goal is to model the graph $gr(f;D)$ using mixed-integer optimization. To facilitate this, often auxiliary continuous $\bm{q}\in\mathbb{R}^{n}$ and integer variables are introduced to help model the complex relationships between variables, although the formulations we study require only binary variables $\bm{z}\in\{0,1\}^{m}$ . A mixed-integer optimization formulation consists of linear constraints on $(\bm{w},y,\bm{q},\bm{z})\in\mathbb{R}^{d+1+n+m}$ which define a polyhedron $Q$ , combined with binary constraints on $z\in\{0,1\}^{m}$ . For a valid formulation, the set $(\bm{w},y)$ associated with a feasible solution $(\bm{w},y,\bm{q},\bm{z})\in Q\cap\mathbb{R}^{d+1+n}\times\{0,1\}^{m}$ must be the same as the graph we desire to model $(\bm{w},y)\in gr(f;D)$ . More formally, the auxiliary variables ( $\bm{q},\bm{z}$ ) are removed via an orthogonal projection $Proj_{\bm{w},y}(Q)=\{\bm{w},y~{}|~{}\exists~{}\bm{q},\bm{z}~{}s.t.~{}\bm{w},y,\bm{q},\bm{z}\in Q\}$ , to leave a set of feasible $(\bm{w},y)$ . Therefore, a valid mixed-integer optimization formulation may be defined as:

Definition 2.1 (Valid mixed-integer optimization formulation)

[TABLE]

We will refer to $Q$ as the linear relaxation of the formulation, which is the MIO formulation with the integrality requirements removed. An MIO formulation is ideal if the extreme points of the polyhedron are binary for those variables that are required to be:

Definition 2.2 (Ideal formulation)

[TABLE]

where $\text{ext}(Q)$ is the extreme points of the polyhedron $Q$ .

3 Further relevant literature

Modeling trained tree ensembles using mixed-integer optimization is studied in Biggs et al. (2022) and Mišić (2020). Mišić (2020) proved this problem in NP-Hard and proposed formulations for unconstrained optimization problems or problems with simple box constraints on each variable. Mistry et al. (2021) provide a customized branch and bound algorithm for optimizing gradient-boosted tree ensembles based on the MIO formulation in Mišić (2020), while Perakis and Thayaparan (2021) also propose a customized branching procedure. Biggs et al. (2022) proposes formulations that include polyhedral constraints. This approach uses the big-M approach to linearize the nonlinear behavior of the trees. To optimize large tree ensembles in a reasonable amount of time, both Mišić (2020) and Biggs et al. (2022) offer ways to decompose a large tree ensemble and propose heuristic approaches that involve truncating trees to a limited depth (Mišić 2020) or sampling a subset of the trees (Biggs et al. 2022). All of these approaches involve solving a mixed-integer optimization formulation of an ensemble of trees.

We follow a “Predict then Optimize” approach, where we study formulations based on an already trained decision tree or tree ensemble, but there has also been significant recent interest in the joint estimation and optimization problem using trees to prescribe actions directly from data (Kallus 2017, Zhou et al. 2018, Bertsimas et al. 2019, Elmachtoub et al. 2020, Biggs et al. 2021, Jo et al. 2021, Amram et al. 2022).

3.1 Formulation from Mišić (2020)

We review the formulation from Mišić (2020) both as a benchmark, and to motivate the formulations we propose. Rather than linking the feature vector $\bm{w}$ directly to the output $f(\bm{w})$ , Mišić (2020) uses a binary representation of the feature vector $\bm{w}$ , which represents whether the feature falls below each split in the tree. Specifically, binary variables are introduced with

[TABLE]

where $\theta_{ij}$ is the $j^{th}$ largest split threshold associated with dimension $i$ . As a result, the $\bm{x}_{i}$ vector has the structure of consecutive 0’s, followed by consecutive 1’s. For example, $\bm{x}_{i}=\{0,1,1\}$ , would correspond to a solution that falls between the first and second thresholds. A drawback of this approach is that additional constraints are needed to incorporate the binary split representation $\bm{x}$ into a constrained optimization problem for $\bm{w}$ .

To introduce the formulation from Mišić (2020), we need to introduce some additional notation. $C(s)$ corresponds to the ranking of threshold $s$ relative to the size of other thresholds for that feature, and $V(s)$ corresponds to the feature involved in the split. For example, if $\theta_{ij}$ is the $j^{th}$ largest threshold for feature $i$ associated with split $s$ , then $C(s)=j$ and $V(s)=i$ . $K_{i}$ denotes the number of thresholds for feature $i$ . Auxiliary variables $\bm{z}$ are introduced, where $z_{l}=1$ if the feature vector falls in leaf $l$ . The polyhedron $Q^{misic}$ , which links the binary representation $\bm{x}$ to the predicted outcome $y$ , is:

[TABLE]

The corresponding MIO formulation imposes binary constraints on $\bm{x}\in\{0,1\}^{K_{i}}~{}\forall i\in[d]$ , but they are not necessary for $\bm{z}$ . Constraint (2a) enforces that if the condition at a split is not satisfied, $x_{V(s)C(s)}=0$ , then the solution does not fall within a leaf to the left of that split in the tree, so $z_{l}=0~{}\forall l~{}\in\textbf{left}(s)$ . Conversely in constraint (2b), if the split is satisfied, $x_{V(s)C(s)}=1$ , then all leaves to the right are set to 0. Constraint (2c) links the solution to the feature vector across trees. If the solution is less than the $j^{th}$ split, $x_{ij}=1$ , then the solution must also be less than all splits greater than this. As such, $x_{ik}=1~{}\forall j<k<K_{i}$ , and the vector has the structure of consecutive zeros followed by consecutive ones.

An issue with the formulations presented in both Mišić (2020) and Biggs et al. (2022) is that the linear relaxation can have many fractional solutions. This can make the MIO slow to solve. In fact, neither formulation is ideal even for the simple case of modeling a single decision tree without any additional constraints on a feasible decision, as we show in the following example.

Example 3.1 (Mišić (2020) not ideal for a single tree)

Suppose there is a tree that first branches on the condition $w\leq 5$ and then on $w\leq 2$ , as shown in Figure 2(a). In this example, $x_{1}=1$ if $w\leq 5$ , and 0 otherwise, while $x_{2}=1$ if $w\leq 2$ . The variables $z_{l}=1$ if the solution is in leaf $l$ . The resulting linear relaxation from Mišić (2020) is:

[TABLE]

This has an extreme point at $z_{1}=0,~{}z_{2}=0.5,~{}z_{3}=0.5,~{}x_{1}=0.5,~{}x_{2}=0.5$ , when constraints $z_{2}\leq 1-x_{2},~{}z_{3}\leq 1-x_{1},~{}x_{2}\leq x_{1},~{}z_{1}+z_{2}+z_{3}=1,~{}z_{1}\geq 0$ are active. \Halmos

Example 3.2 (Biggs et al. (2022) not ideal for a single tree)

*Again, suppose there is a tree that first branches on the condition $w\leq 5$ and then on $w\leq 2$ , as shown in Figure 2(b). This formulation uses a slightly different notation, where $x_{ij}=1$ if the arc is on the path to the active leaf, $i$ corresponds to the parent node, $j=1$ refers to the left branch, and $j=2$ refers to the right branch. For example, if $w\leq 2$ , then $x_{11},x_{21}=1$ , while $x_{12},x_{22}=0$ . We also assume $w$ is bounded, $0\leq w\leq 10$ , and following guidance in Biggs et al. (2022) for choosing the big-M value, we set $M=15$ . The resulting formulation in Biggs et al. (2022) is: *

[TABLE]

This has an extreme point at $x_{11}=1/3,~{}x_{12}=2/3,~{}x_{21}=1/3,~{}x_{22}=0,w=0$ , when constraints $w+15(1-x_{12})\geq 5,~{}x_{21}+x_{22}=x_{11},~{}x_{11}+x_{12}+x_{21}+x_{22}=1,~{}w\geq 0,~{}x_{22}\geq 0$ are active. Furthermore, this is not just a consequence of the choice of $M$ but is still an issue regardless of this choice. \Halmos

4 Union of polyhedron formulation

We propose an alternative MIO formulation for decision trees, which is tighter in the sense that it is ideal for modeling a single tree, unlike those presented in Example 3.1 and 3.2. In contrast with the formulation in Mišić (2020), our proposed formulation directly relates the feature vector $\bm{w}$ , to the output $f^{(t)}(\bm{w})$ , instead of using a binary representation of the feature vector. This has an advantage that constraints can be placed directly on the feature vector $\bm{w}$ for problems with additional constraints that need to be modeled.

We can formulate a tree as a union of polyhedra since the solution will always fall into one of the leaves (hyperrectangles) that partition the feature space. This can be achieved using the classical extended formulation from Jeroslow (1987), which introduces many auxiliary variables to model the set. This is also known as a “multiple choice” formulation Vielma and Nemhauser (2011):

[TABLE]

The formulation works by creating $p$ auxiliary copies of each variable, $\bm{w}^{\textit{l}}\in\mathbb{R}^{d},y^{l}\in\mathbb{R}$ , corresponding to each leaf to make the MIO formulation. Auxiliary binary variables $z_{l}\in\{0,1\}^{p}$ are also introduced, which indicate which leaf the solution falls into. When $z_{l}=1$ , constraints (3a), (3b), and (3c) define the feasible region and score for that leaf. When $z_{l}=0$ , these constraints enforce that $\bm{w}^{\textit{l}}$ is set to be a vector of zeros. Constraints (3d) ensures only one leaf is chosen. Constraint (3e) and (3f) in turn define $\bm{w}$ and $y$ according to which leaf is active.

This formulation is ideal as proved in Jeroslow and Lowe (1984) and Balas (1985), so the linear relaxation is guaranteed to have integer extreme points. However, these formulations often have computational issues when solved in practice (Vielma 2019). This formulation introduces a large number of auxiliary variables ( $(p+1)(d+2)$ variables in total), as well as many constraints $(2pd+3p+d+1)$ . It is well known that these formulations suffer from degeneracy, as many of the auxiliary variables are set to be 0, often resulting in poor performance in practice (Vielma 2019).

We can improve upon this formulation by projecting onto $\bm{w}$ . This eliminates the variables $\bm{w}^{l}$ and thus results in a significantly smaller formulation.

[TABLE]

We can prove this formulation is still ideal for a single tree after this projection.

Theorem 4.1 (Ideal formulation for a tree)

The polyhedron $Q^{\text{proj}}$ is ideal.

This is proved in Appendix 8.1. The main idea behind this proof is that the union of polyhedra formulation (3) is ideal, and therefore the projection onto variables $\bm{w}$ is also ideal. These ideal projected formulations always exist, but in general, the projection is not a tractable operation and can result in a formulation with exponentially many constraints. In this special case, the resulting formulation (4) has only $2d+1$ constraints (in addition to binary constraints) and $p+d+1$ variables. Compared to formulation (3), this has significantly fewer variables and therefore does not suffer from degeneracy to the same extent. We also note that this formulation has considerably fewer constraints than in Mišić (2020), which has approximately $3p$ constraints and $2p$ variables since typically $d<<p$ .

The significance of this result is that it suggests that tree-based optimization approaches that use formulation (4) will be tighter than those used in Biggs et al. (2022) or Mišić (2020). Specifically, there are fractional solutions for each tree, as shown in Examples 3.1 and 3.2, which do not exist in formulation (4). Although in general, the intersection of different tree polytopes, as occurs in tree ensemble optimization, introduces additional fractional solutions. This also occurs for the intersection of a tree polytope and additional polyhedral constraints. However, in practice, this formulation often results in a faster time to solve, particularly for forests with relatively few trees.

If formulation (4) is reformulated slightly, we can prove some additional favorable properties, including, in particular, that the constraints are facet-defining.

Definition 4.2 (Facet)

A face $\mathcal{F}$ of a polyhedron $\mathcal{P}$ , represented by the inequality $\bm{a}^{\prime}\bm{x}\geq b$ , is called a facet of $\mathcal{P}$ if $dim(\mathcal{F})=dim(\mathcal{P})-1$ .

One of the variables $z_{p}$ can be eliminated through the substitution $z_{p}=1-\sum_{l=1}^{p-1}z_{l}$ . Consequently, $\bm{z}\in\{0,1\}^{p-1}$ and as a result, $\bm{z}=0$ implies $\bm{w}\in\mathcal{L}_{p}$ . This leads to the following formulation:

[TABLE]

We can show that under mild assumptions, (5a) and (5b) are facet-defining.

Lemma 4.3

For all $l\in[p]$ , assume $\mathcal{L}_{l}$ is non-empty. Furthermore, assume that for some $k\in[p]$ , $\mathcal{L}_{k}$ is full dimensional, i.e., $dim(\mathcal{L}_{k})=d$ . Then constraints (5a) and (5b) are facet-defining for leaf $k$ .

This is proved in 8.2 with a proof technique similar to that in Anderson et al. (2018). This result is significant because it suggests there is no redundancy in formulation (5). MIO formulations generally take longer to solve when there are redundant variables and constraints.

4.1 Extensions to tree ensembles and additional constraints

The formulation can be applied to tree ensembles such as random forests or gradient-boosted tree ensembles. While the polyhedron modeling an individual tree is ideal, this formulation is not ideal in general as shown in this section. An alternative, but weaker, notion of tightness is whether a formulation is sharp. For a sharp formulation, the projection of the polyhedron $Q$ onto the original variables $\bm{w},y$ is equal to the convex hull $(\text{conv}(\cdot))$ of the graph $gr(f;D)$ . This is formalized as follows:

Definition 4.4 (Sharp formulation)

[TABLE]

An ideal formulation is also sharp, but a sharp formulation isn’t necessarily ideal. In Example 4.6 we give a simple tree ensemble that illustrates that the union of polyhedra formulation is not ideal and not sharp.

Example 4.5 (Intersection of trees is not ideal or sharp)

Suppose we have the following two trees in an ensemble:

[TABLE]

This leads to a tree ensemble:

[TABLE]

*This is visualized in Figure 4.6, where $f^{(1)}(w)$ is the blue line, $f^{(2)}(w)$ is the red line and the ensemble $0.5(f^{(1)}(w)+f^{(2)}(w))$ is the purple dashed line. The union of polyhedra formulation for this is as follows: *

[TABLE]

A basic feasible solution for this formulation is $w=1,~{}z_{1}^{(1)}=0,~{}z_{2}^{(1)}=1,~{}z_{2}^{(1)}=0.5,~{}z_{2}^{(2)}=0.5,~{}y=3.25$ , which is not integral, so the formulation is not ideal. Furthermore, the projected solution, $w=1,~{}y=3.25$ , is not in the convex hull of $0.5(f^{(1)}(w)+f^{(2)}(w))$ , so the formulation is not sharp. This can be observed in Figure 3(c), where the convex hull of the graph of the tree ensemble is shown in shaded purple. The extreme points of $Q^{proj}$ projected into $w,y$ space are shown with hollow circles. As can be observed, there are two extreme points of $Q^{proj}$ that lie outside the convex hull of the graph. \Halmos

We also provide an example illustration that adding additional constraints to the feature vector, which may be useful for many practical applications, is not ideal.

Example 4.6 (Adding additional constraints to a tree is not ideal)

Take the tree from Figure 1. Suppose that we add a simple constraint that $w_{1}+w_{2}\leq 3$ . Suppose additionally that there are upper and lower bounds on each feature, such that $0\leq w_{1},w_{2}\leq 3$ . The union of polyhedra formulation is:

[TABLE]

This has a fractional solution $w_{1}=2/3,~{}w_{2}=7/3,~{}z_{1}=2/3,~{}z_{2}=0.0,~{}z_{3}=1/3$ , so it is not ideal.\Halmos

While the intersection of trees is not ideal or sharp, it still removes a significant number of fractional solutions from the linear relaxation compared to using formulations from Mišić (2020) or Biggs et al. (2022) leading to faster solve times as explored empirically in Section 6.

5 Strengthening formulations with binary split variables

We next present formulations that build upon the formulation from Mišić (2020). In particular, these formulations use the binary variables from Mišić (2020), which denote whether the feature vector is below each threshold in the tree. An advantage of this approach is its favorable branching behavior – setting a variable $x_{ij}=1$ will force all variables with a split threshold above this to also be 1, due to the ordering constraints $x_{ij}\leq x_{ij+1}$ (2c). In some cases, this results in a faster time to solve than the formulation in the previous section. We propose two ways to tighten this formulation to remove some of the fractional solutions, resulting in tighter linear relaxations and a faster time to solve in certain situations.

5.1 Tighter formulation from variable structure

To tighten the formulation from Mišić (2020), we exploit the greater than or equal to representation of $\bm{x}$ , which leads to larger groups of leaf variables being turned off when a split is made. In Mišić (2020), the $\bm{x}$ variables have consecutive 0’s followed by consecutive 1’s. In Mišić (2020), if $x_{ij}=0$ , this implies that all variables $z_{l}$ to the left of the split are equal to 0 (constraint 2b). However, a stronger statement can be made. Due to the structure of $\bm{x}$ , all variables with lower thresholds are also equal to 0, i.e., $x_{ik}=0~{}\forall k<j$ . This implies that variables $z_{l}$ to the left of splits with lower thresholds also must be equal to 0.

As an illustrative example, we examine the tree in Figure 4(a). If $w_{2}>5$ ( $x_{22}=0$ ), then not only is the variable to the left of this split equal to 0, $z_{3}=0$ , but also $z_{1}=0$ due to the constraint $x_{21}\leq x_{22}$ (constraint (2c) from Mišić (2020)). Rather than enforcing the relatively weak constraint from Mišić (2020) that $z_{3}\leq x_{22}$ , it is tighter to directly enforce $z_{1}+z_{3}\leq x_{22}$ . Similarly, if $x_{ij}=1$ , this implies that the variables $z_{l}$ to the right of any splits greater than the $j^{th}$ split are also set to 0. For example in Figure 4(a), if $w_{2}\leq 2$ ( $x_{12}=1$ ), then not only is the variable to the right of this split equal to 0 ( $z_{2}=0$ ), but also $z_{4}=0$ , since the structure of $\bm{x}$ implies that $w_{2}\leq 5$ ( $x_{22}=1$ ).

To formalize this logic, we introduce new sets $\textbf{below}(s)$ and $\textbf{above}(s)$ . The set $\textbf{below}(s)$ contains all leaves to the left of splits with thresholds less than or equal to the threshold at split $s$ for a given tree. The set $\textbf{above}(s)$ contains all leaves to the right of leaves with a threshold greater than or equal to the threshold at split $s$ . As such, for adjacent splits on the same feature, $s_{ij}$ and $s_{ij+1}$ , we can define $\textbf{below}(s_{ij+1})=\textbf{below}(s_{ij})\cup\textbf{left}(s_{ij+1})$ and $\textbf{above}(s_{ij})=\textbf{above}(s_{ij+1})\cup\textbf{right}(s_{ij})$ . For the smallest and largest splits, we have initial conditions that $\textbf{below}(s_{i1})=\textbf{left}(s_{i1})$ , and $\textbf{above}(s_{iK_{i}})=\textbf{right}(s_{iK_{i}})$ . An equivalent pair of definitions are $\textbf{below}(s_{ij})=\bigcup_{k\leq j}\textbf{left}(s_{ik})$ and $\textbf{below}(s_{ij})=\bigcup_{k\geq j}\textbf{right}(s_{ik})$ . An example of these sets is illustrated in Figure 4(a). As a result, we can introduce a new formulation $Q^{expset}$ , named after the notion of expanded sets, by replacing (2a) and (2b) with the following constraints:

[TABLE]

Constraints (8a) and (8b) are the counterparts of (2a) and (2b). Constraint (8a) enforces that when the condition at the split is not satisfied $x_{V(s)C(s)}=0$ , the solution does not fall within a leaf to the left of any split in the tree with a lower threshold for the same feature, while constraint (8b) enforces that all leaves to the right of greater splits are set to 0 if $x_{V(s)C(s)}=1$ , as discussed previously. It can be shown that when intersected with a binary lattice on $\bm{x}\in\{0,1\}^{p}$ , the feasible set of the MIO formulations (2) and (8) is the same. However, the linear relaxation, $Q^{expset}$ is generally a subset of $Q^{misic}$ . This is shown in Proposition 5.1, which formalizes the rationale given above.

Proposition 5.1

The feasible sets associated with MIO formulations of $Q^{expset}$ and $Q^{misic}$ are equivalent, but the linear relaxation $Q^{expset}$ is a subset of $Q^{misic}$ . Formally,

[TABLE]

We provide a formal proof in Appendix 9. It can be shown that this formulation removes some fractional solutions from the LP relaxation of (2). In particular, this will occur when there are multiple splits on the same feature within the tree. To illustrate this, suppose we have two splits on the same variable, $s$ and $s^{\prime}$ , where without loss of generality split $s^{\prime}$ has the larger threshold. Define a reduced polyhedron that only includes the constraints related to these splits as follows:

[TABLE]

If we examine these polyhedrons, we see that the $\tilde{Q}^{expset}(s,s^{\prime})$ is a strict subset of $\tilde{Q}^{misic}(s,s^{\prime})$ when there are multiple splits on the same variable.

Proposition 5.2

Suppose we have two splits on the same variable, $s$ and $s^{\prime}$ , where $s^{\prime}$ corresponds to the split with the larger threshold. Then

[TABLE]

This is proved in Appendix 10. This proof involves exploring the potential relationships between splits $s$ and $s^{\prime}$ (where split $s$ is a child of $s^{\prime}$ in the tree, where $s^{\prime}$ is a child of $s$ , and where neither is a child of the other) and finding solutions $(\bm{x,z})$ that are in $\tilde{Q}^{misic}(s,s^{\prime})$ but not in $\tilde{Q}^{expset}(s,s^{\prime})$ . An example that illustrates the strict subset is given in Example 5.6 from Section 5.3. In this example, we see that formulation (2) has fractional solutions, while formulation (8) has only integer solutions.

Generally, the more splits there are on the same feature in the tree, the more these constraints will tighten the formulation. At an extreme, we have the scenario where all splits in the tree are on the same feature. In the one-dimensional setting, it can be shown that the above formulation is ideal even for tree ensembles.

Theorem 5.3 (Ideal formulation for one-dimensional tree ensembles)

The polyhedron defining a tree ensemble $\cap_{i=1}^{T}Q_{i}^{\text{expset}}$ is ideal if the feature is one-dimensional ( $d=1$ ).

This result is proved in Appendix 12. It follows by proving that the matrix representation of the polyhedron is totally unimodular. In particular, the matrix has a special structure whereby it is possible to provide a bi-coloring of the columns, such that the difference in row sums between the two groups is in $\{-1,0,1\}$ . A result from Ghouila-Houri (1962) proves that such a matrix is totally unimodular. A linear program $\{\max\bm{c}^{\prime}\bm{x}|A\bm{x}\leq\bm{b}\}$ has integer solutions if $b$ is integer and $A$ is a totally unimodular matrix (Schrijver 1998).

The significance of this result is that it emphasizes the tightness of this formulation, relative to other formulations that are not ideal in this situation and have fractional solutions. In particular, in Example 3.1, we show that formulation (2) is not ideal even if the problem is one-dimensional with a single tree. Furthermore, although the formulation isn’t ideal when the input vector has multiple dimensions, we empirically show in Section 6.1.1 that the relaxation is tighter when the input vector is low dimensional.

It is interesting to contrast this result with Theorem 4.1. Theorem 4.1 states that the union of polyhedra formulation is ideal for a single tree even with many features. This contrasts with Theorem 5.3, which shows the expset formulation is ideal for many trees but only if the ensemble has a single feature. While it is difficult to directly compare the tightness of these formulations since they use different variables, this gives practitioners insight into the relative tightness of the different formulations. When there are many trees in the ensemble but relatively few variables, the expset formulation is likely to be tighter. When there are few trees but many variables, the union of polyhedra formulation is likely to be tighter.

5.2 Tighter formulation from nested branches

The relaxation of the formulation in the previous section still has some fractional extreme solutions, even in the case where a single tree is being modeled over multiple features. These fractional extreme solutions often arise when there are nested splits on the same feature, where one split follows another on the same branch. This is highlighted in the following example.

Example 5.4 (Nested branches that can be tightened)

Consider a path to a leaf, which has two splits on the same variable in opposing directions as shown in Figure 5(a). Suppose we model this using the formulation (2) from Mišić (2020):

[TABLE]

This has an extreme point $z=0.5,~{}x_{1}=0.5,x_{2}=0.5$ , as shown in Figure 5(b). Consider the following reformulation:

[TABLE]

This is shown in Figure 5(c). As can be observed, this has removed the fractional extreme point, leaving only integer extreme points. \Halmos

These fractional extreme points generally occur when a split to the left is followed by a split to the right for the same feature, or vice versa. More formally, we can characterize a valid set of constraints as follows: We define $\textbf{right\_parent}(s)$ as the set of splits that are above and to the right of split $s$ in the tree, with the additional requirement that these splits be on the same feature. That is, the split $s$ is a left child of another split on the same feature in the tree. For the splits in this set, the thresholds are necessarily larger. We can also define $\textbf{left\_parent}(s)$ as the set of splits that are above and to the left of split $s$ for the same feature, for which the threshold is smaller. To illustrate this notation, in Figure 4(b) the split $w_{2}\leq 2$ is the left_parent of the split $w_{2}\leq 4$ . We can generalize the constraints from Example 5.4 as follows:

[TABLE]

If we define $Q^{elbow}$ as the polyhedron created by adding constraints (9a) and (9b) to formulation (2) from Mišić (2020), we can show that the relaxation of this formulation is tighter, while still having the same feasible region when $\bm{x}$ is restricted to a binary lattice, as shown in Proposition 5.5.

Proposition 5.5

The feasible set associated with MIO formulations $Q^{elbow}$ and $Q^{misic}$ are equivalent, but linear relaxation $Q^{elbow}$ is a subset of $Q^{misic}$ . Formally,

[TABLE]

This is proved formally in Appendix 11. As illustrated in Example 5.4, the feasible region is often a strict subset when there are nested splits on the same feature ( $Q^{elbow}\subset Q^{misic}$ ). This suggests that when there are more splits on the same features in the tree, there will be more of an improvement using the elbow formulation over Mišić (2020). This also often occurs if the tree has fewer features. This is explored empirically in Section 6. However, simulation results suggest that the formulation is not ideal for tree ensembles with a single feature, unlike the expset formulation.

5.3 Comparison of tightening constraints

In this section, we compare the relative tightness of the expset and elbow formulations (8 and 9, respectively). We will show that when these constraints are added separately to formulation (2) from Mišić (2020), neither formulation is strictly tighter than the other. Rather, there are certain situations where one formulation is tighter than the other and vice versa, which we illustrate with examples.

A simple example where formulation (8) is tighter than formulation (9) is when there are multiple splits on the same variable, but they do not have a nested structure. For example, in the tree in Figure 4(a), there are two splits on $w_{2}$ , but these occur in different branches of the tree. In this situation, formulations (2) and (9) are the same since the constraints are added only for nested pairs of the same feature. Furthermore, formulation (9) is not tight, but the formulation (8) is tight.

Example 5.6 (Expset Formulation is tighter than elbow formulation)

*For the tree given in Figure 4(a), formulation (9) (and formulation (2)) is: *

[TABLE]

*On the other hand formulation (8) is: *

[TABLE]

For convenience, the difference in the formulations has been highlighted. Formulation (9) has fractional solutions $x_{11}=0.5,x_{21}=0.5,x_{22}=0.5,z_{1}=0,z_{2}=0.5,z_{3}=0,z_{4}=0.5$ , and $x_{11}=0.5,x_{21}=0.5,x_{22}=0.5,z_{1}=0.5,z_{2}=0,z_{3}=0.5,z_{4}=0$ , while formulation (8) has only integer solutions since the above fractional solutions violate the added constraints. \Halmos

To further understand the difference between the constraints from formulations (9) and (8), it is useful to examine situations in which they are the same. In particular, suppose we have two nested splits on the same feature, such that $s^{\prime}\in\textbf{right\_parent}(s)$ , as in the tree in Figure 5(a). We will examine constraints (8a) and (8b) and see when they imply the alternative constraint (9a). Specifically, we require that that $\textbf{above}(s)$ and $\textbf{below}(s^{\prime})$ cover the whole set of leaves, that is, $\textbf{below}(s^{\prime})\cup\textbf{above}(s)=p$ . This is formally stated in Lemma 5.7.

Lemma 5.7

Suppose $s^{\prime}\in\textup{{right\_parent}}(s)$ . If $\textup{{below}}(s^{\prime})\cup\textup{{above}}(s)=p$ ,

[TABLE]

Similarly, suppose $s^{\prime}\in\textup{{left\_parent}}(s)$ . If $\textup{{above}}(s^{\prime})\cup\textup{{below}}(s)=p$ ,

[TABLE]

This is proved in Appendix 13. The condition $\textbf{below}(s^{\prime})\cup\textbf{above}(s)=p$ is satisfied when all splits above $s$ are on the same feature, or as an extreme case when the tree contains only one feature (the same condition as Theorem 5.3). When these conditions are not met, including constraint (9a) will tighten the formulation. An example where this condition is not met and formulation (9) is tighter than formulation (8) occurs in Figure 4(b).

Example 5.8 (Elbow formulation is tighter than expset formulation)

*For the tree from Figure 4(b), formulation (8) is: *

[TABLE]

*Formulation (9) is: *

[TABLE]

For convenience, the difference in the formulations has been highlighted again. Formulation (8) has a fractional solution $x_{1}=0.5,x_{2}=0.5,x_{3}=0.5,z_{1}=0.5,z_{2}=0,z_{3}=0.5,z_{4}=0$ , while formulation (9) has only integer solutions. \Halmos

Since each formulation has the advantage of removing different fractional solutions, including both sets of constraints can tighten the formulation further. We empirically explore how much these additional constraints tighten the LP relaxation for various datasets in Section 6.1.1.

6 Numerical Experiments

In this section, we study the numerical performance of the formulations on both simulated and real-world data. We study two scenarios of practical interest. The first involves the time taken to solve to optimality for an objective estimated by a tree ensemble. We then focus on finding tight upper bounds to this problem, obtained by solving the linear relaxation.

6.1 Experiments with tree ensembles

In this section, we examine the time taken to solve to optimality for a problem where the objective function is estimated using a random forest. We compare formulation (4) denoted projected and formulation (9) denoted elbow, to formulation (2) from Mišić (2020), denoted misic, and a formulation that uses the big-M method from Biggs et al. (2022), denoted bigM.

The random forest is trained on previous decisions where the reward is generated from a simple triangle-shaped function, where observed samples have added noise:

[TABLE]

For this problem, $r_{i}$ is a sampled reward, $w_{i}\sim U(-1,1)^{d}$ is a random decision vector with $d$ features, and $\epsilon_{i}\sim U(0,1)$ is added noise. There are no additional constraints placed on the variables other than those used to model the tree. We train a random forest from this data using scikit-learn (Pedregosa et al. 2011). We calculate the solve time to optimality with an increasing number of trees in the forest and an increasing number of features. We increase the number of trees according to $\{1,2,4,8,16,32\}$ , and the number of features from 1 to 5. We repeat the experiment for 10 randomly generated datasets for each forest size and number of features. We use default parameters and a maximum depth for each tree of 20. For these parameters, each tree has an average of $2893$ leaves. We show example problem sizes of the formulations when there are 5 features in Table 1. This shows the number of constraints, binary variables, and we show the sparsity of the constraint matrix with the number of nonzero entries. As noted earlier, the number of constraints in projected formulation is substantially smaller, while the number of binary variables is also less than the other formulations. The MIO formulations were solved using a Gurobi solver (Gurobi Optimization (2019)), with a time limit of 30 minutes (1800s) for each trial but otherwise default parameters. The experiments were run on a MacBook Pro with an Intel 8-Core [email protected] with 32GB RAM.

In Table 2 we observe the time taken to solve optimally for different-sized trees. Each result is averaged over 50 trials: 10 trials for each input vector of 1 to 5 dimensions. We note that the average time taken includes instances that didn’t reach optimality, recorded as the maximum time allocated (1800s), so it is in fact a truncated mean. The percentage of instances that didn’t reach optimality is recorded in the last four columns. As can be seen, the projected formulation is on average three to four times faster, and it finds an optimal solution more often within the given time.

Figure 6 shows the results further broken down by the number of features, plotted on a log-log axis for clarity. We observe that the elbow formulation is often faster for tree ensembles with few trees. This might be useful in applications where many MIO problems need to be solved rapidly, such as policy iteration in reinforcement learning with tree-based value function approximations. We also observe a substantial solve time improvement using the elbow formulation when there is one feature, which agrees with the results presented in Section 5.2.

We omitted the expset formulation (8) from these results because despite having a tighter linear relaxation (which is studied further in the following section), the solve time in practice was significantly slower. We conjecture that this is due to the increased density of the constraints, which contain many more variables, although it could also be due to other idiosyncracies of MIO solvers.

6.1.1 Tighter linear relaxations

A problem of practical interest is finding tight upper bounds for maximization problems over an objective estimated by a tree ensemble. For large problem instances, finding the optimal solution can be prohibitively slow, considering that MIO formulations often exhibit exponential solve times. The relative quality of a fast heuristic solution can be assessed if an upper bound on the objective can be found. Another application of upper bounds is the verification of the robustness of a machine learning model (Carlini and Wagner 2017, Dvijotham et al. 2018) whereby an optimization problem is solved over local inputs to find maximally different output. Since finding the exact worst case can be prohibitively slow for large instances, a tight upper bound is often used instead (Carlini and Wagner 2017, Dvijotham et al. 2018).

We analyze the formulations from Section 5.1 by analyzing the tightness of the linear relaxation. We compare formulations that use the same variables, specifically formulation (8, expset), formulation (2, misic), and (9, elbow). Additionally, we test a formulation that has both of the tightening constraints (expset+elbow). We use the same data-generating process as in Section 6.1, except rather than solving to find the optimal integer solution, we solve only the linear relaxation. For these experiments, we use forests with $\{2,4,6,8,10\}$ trees, and increase the features according to $\{1,2,4,8,12\}$ . Again, we repeat each experiment with 10 randomly generated datasets.

Figure 7 shows the optimality gap percentage, calculated from the difference between the objective of the linear relaxation and the optimal integer solution, as the number of features increases. We observe the effect of Theorem 5.3, whereby for tree ensembles with one feature, formulations based on expset are ideal. Moreover, for problems with relatively few features, the formulation is significantly tighter than formulation misic, whereas when the number of features is larger, the improvement is smaller. This is likely due to more features being associated with fewer splits per feature. We note that in isolation, the constraints introduced in expsum have a greater effect in tightening the formulation than those introduced in elbow, although combining both results in the tightest formulations. We also observe empirically that the elbow formulation is not ideal even in the single feature case.

6.2 Real-world data

We also study some datasets used to benchmark tree ensemble solve times used in Mišić (2020). In particular, we study the concrete dataset (Yeh 1998), with 1030 observations. The dependent variable is the compressive strength of concrete, with independent variables being the characteristics of the concrete mix. 111Cement, BlastFurnaceSlag, FlyAsh, Water, Superplasticizer, CoarseAggregate, FineAggregate, Age. Optimization aims to find the concrete with the highest compressive strength. We also study the winequalityred dataset Cortez et al. (2009), with 1599 observations. The dependent variable is the quality of the wine, while the independent variables are characteristics of the wine. 222fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, alcohol. As such, the optimization problem is to choose characteristics of the wine such that the quality is maximized.

6.2.1 Solve time

We explore the solve time for different formulations of different size random forest tree ensembles $\{10,20,40,80,160\}$ and varying feature vector dimension $\{1,3,5,7\}$ for concrete and $\{1,5,10\}$ for winequalityred. To test the effect of dimension, we use the first $k$ features to predict the output. As in the previous section, we set the maximum solve time to be 30 minutes (1800s).

The results for concrete and winequalityred are in Figures 8 and 10, respectively. We observe that for both datasets, the projected formulation performs relatively better than the formulation from Mišić (2020) for instances where the feature vector has a lower dimension (fewer features). On the other hand, for instances with a larger number of features, the formulation Mišić (2020) can be faster to solve. Furthermore, the projected formulation (4) appears to be relatively faster for formulations with a small number of trees, which is particularly pronounced in Figures 8(c) and 10(c). This is potentially an extension of Theorem 4.1; if (4) is ideal for a single tree, it is also potentially relatively tighter for a small number of trees. Again, this might have applications where many smaller problems need to be solved quickly, such as in reinforcement learning. For these datasets, the performance of the elbow formulation is generally comparable to Mišić (2020), although there are improvements in the concrete dataset when there are few features.

6.2.2 Tightness of linear relaxation

We also compare the tightness of the linear relaxations for the concrete and winequalityred datasets in Figures 11 and 9. Across both datasets, we observe a similar outcome to the synthetic data experiments, whereby elbow+expset is generally the tightest, followed by expset, and finally the original misic formulation. We also observe that generally, the difference diminishes when there are more features in the data, potentially because there are fewer splits per feature, which is typically where the new formulations remove fractional points.

7 Conclusions and future work

In this paper, we have proposed a variety of new mixed-integer optimization formulations for modeling the relationship between an input feature vector and the predicted output of a trained decision tree. We have introduced formulations that build on the variable structure from Mišić (2020) and formulations that use the input feature directly. We have shown these formulations are provably tighter than existing formulations in some scenarios and have also characterized when some are tighter than others. We have shown conditions where these formulations are ideal, which gives further practical insight into when different formulations might be advantageous depending on the number of trees in the ensemble and the number of features the problem has. In addition to these theoretical insights, we have given experimental conditions where the different formulations succeed both in terms of the time taken to solve to optimality and the tightness of the corresponding linear relaxations. While the experimental results do not always fully agree with the theoretical findings or intuition due to the complex operations of commercial MIO solvers, we have identified situations where each different formulation has advantages and laid the groundwork for future computational studies.

For future work, an interesting avenue is exploring the relationship between the formulations we provide and different polyhedral constraints. While in general, the formulations we provide are not ideal when combined with additional constraints, there may be special cases when they are or at least cuts that can be introduced to remove some of the fractional solutions.

8 Proofs from paper

8.1 Proof Theorem 4.1

Proof 8.1

We prove this by applying Fourier-Motzkin elimination to formulation (3) to eliminate all $\bm{w}^{l}$ , and showing we arrive at formulation (4). An overview of the technique can be found in Hooker (2011). For convenience, recall $Q^{ext}$ :

[TABLE]

To eliminate $\bm{w}^{l}$ , we will use induction. To be more precise, we will show how to eliminate $w_{i}^{1},...,w_{i}^{p}$ for a single feature $i$ , but applying the same procedure to the other features is identical. For notational brevity, let us define $Q^{const}$ as the set of constraints that do not feature $w_{i}^{1},...,w_{i}^{p}$ and do not change with elimination.

[TABLE]

Define $Q^{proj}_{k}$ as the polyhedron resulting from applying Fourier-Motzkin elimination $k$ times on $Q^{ext}$ to eliminate $w_{i}^{1},...,w_{i}^{k}$ . We propose $Q^{proj}_{k}$ is

[TABLE]

As the inductive step, if we apply Fourier-Motzkin elimination to $Q^{proj}_{k}$ to eliminate $w_{i}^{k+1}$ , we will show that $Q^{proj}_{k+1}$ is the resulting polyhedron. First, we establish the base case, that applying Fourier-Motzkin elimination on $Q^{ext}$ to eliminate $w_{i}^{1}$ results in $Q^{proj}_{1}$ .

To apply Fourier-Motzkin elimination, we rearrange all constraints involving $w_{i}^{1}$ into greater than constraints $w_{i}^{1}\geq G_{j}(\bm{w},y,\bm{w}^{l},y^{l},\bm{z})$ or less than constraints $w_{i}^{1}\leq L_{j^{\prime}}(\bm{w},y,\bm{w}^{l},y^{l},\bm{z})$ . We eliminate these constraints and replace them with $L_{j^{\prime}}(\bm{w},y,\bm{w}^{l},y^{l},\bm{z})\geq G_{j}(\bm{w},y,\bm{w}^{l},y^{l},\bm{z})$ for all combinations $j$ and $j^{\prime}$ . As a result, the new constraints formed are

[TABLE]

Where the constraint (13a) is formed by combining (10a) and (10b), constraint (13b) is from (10a) and (10e), (13c) is from (10b) and (10e), and (13d) is from (10e). By definition, constraint (13a) is redundant and can be eliminated, since $b_{1i}\leq u_{1i}$ , as can (13d). As result, the polyhedra is:

[TABLE]

*We can apply the same logic to prove the inductive step. If we apply Fourier-Motzkin elimination to $Q^{proj}_{k}$ to eliminate $w_{i}^{k+1}$ we get *

[TABLE]

Again, constraints (15a) and (15d) are redundant and can be eliminated, since $u_{ki}\geq b_{ki}$ for all $k\in[p]$ . Through some minor rearranging, the resulting polyhedron is

[TABLE]

This proves the inductive step. After eliminating $w_{i}^{p}$ from $Q^{proj}_{p}$ , it should be clear that this results in

[TABLE]

We can repeat the inductive procedure for the other features in the same manner. Finally $y^{l}$ is eliminated by simple substitution of (10c) into (10f), and we arrive at formulation (4). The proof follows since the ideal property is preserved by projection. \Halmos

8.2 Proof Lemma 4.3

Proof 8.2

We show that (5b) is facet defining. It can be proved that (5a) is facet defining using the same argument. The dimension of this polyhedron is $p+d$ . To show constraint (5b) is facet defining, we need to find $p+d$ affinely independent points that satisfy $\sum_{l=1}^{p}u_{ni}+(u_{li}-u_{ni})z_{l}=w_{i}$ .

Constraint (5b) places bounds on dimension $i$ of $\bm{w}$ . Without loss of generality, consider $k=1$ and $i=d$ . Define $\bm{\hat{w}}$ as a point on the interior of the leaf $\mathcal{L}_{1}$ with respect to dimensions $1,...,d-1$ , but at the lower bound for dimension $d$ , so that $\hat{w}_{d}=b_{1d}$ . Define the point $\bm{q}^{0}=(\bm{\hat{w}},\bm{e}^{1})$ . This point satisfies (5b) with equality.

Consider $d-1$ points $\bm{q}^{i}=(\bm{\hat{w}}+\epsilon\bm{e}^{i},\bm{e}^{1})$ for $i\in\{1,...,d-1\}$ , where $\epsilon>0$ is chosen to be sufficiently small that $\bm{q}^{i}\in\mathcal{L}_{1}$ . A sufficiently small $\epsilon$ exists due to the fact that $\bm{\hat{w}}$ is on the interior with respect to dimensions $1,...,d-1$ . These points still satisfy (5b) with equality as $q^{i}_{d}$ remains equal to $q^{i}_{d}=b_{1d}$ .

Consider $p-1$ points $\bm{\tilde{q}}^{i}=(\bm{\tilde{w}}^{i},\bm{e}^{i})$ for $i\in\{2,...,p-1\}$ and $\bm{\tilde{q}}^{p}=(\bm{\tilde{w}}^{i},0)$ , where $\bm{\tilde{w}}^{i}\in\mathcal{L}_{i}$ and $\tilde{w}^{i}_{i}=l_{id}$ . Such a point exists because $\mathcal{L}_{i}$ is non-empty.

We now need to show that these points are linearly independent. This can be proven by showing the following matrix is full rank:

[TABLE]

If we shift the $p-2$ last columns to be the first $p-2$ columns and the $p-1$ to last column to $p-1$ from first, we end up with an upper diagonal matrix with nonzero entries on the diagonal, resulting in a matrix with full row rank. Since we applied only elementary operations to the original matrix, this also has full row rank.

[TABLE]

This proves that the points were linearly independent and (5b) is facet defining.\Halmos

9 Proof of Proposition 5.1

Proof 9.1

Let $\bm{z},\bm{x}$ be any feasible solution to $Q^{misic}\cap(\{0,1\}^{p}\times\mathbb{R}^{1+p})$ , where $\bm{x}$ is restricted to a binary lattice. We will show that $\bm{z},\bm{x}$ is feasible for $Q^{expset}\cap(\{0,1\}^{p}\times\mathbb{R}^{1+p})$ .

For a given split $s$ , suppose $x_{V(s)C(s)}=0$ . Then $x_{V(s^{\prime})C(s^{\prime})}=0$ for all $s^{\prime}$ that have a lower threshold on the same variable, $C(s^{\prime})\leq C(s),V(s^{\prime})=V(s)$ . This is due to constraint 2c, $x_{ij}\leq x_{ij+1}$ , which enforces that $\bm{x}$ is a vector of 0’s, followed by 1’s. Therefore, combined with constraint 2a, all leaf variables $z_{l}$ are set to 0 for all leaves with thresholds less than $s$ :

[TABLE]

We analyze constraint 8a from $Q^{expset}$ to check if it is satisfied. We begin by expanding the constraint out:

[TABLE]

where $s_{-1}$ is the next threshold below $s$ , such that $C(s)=C(s_{-1})+1$ , and $s_{-2}$ is the next below that. This follows from the definition of the set $\textbf{below}(s_{ij+1})=\textbf{below}(s_{ij})\cup\textbf{left}(s_{ij+1})$ . From equation 20, all leaves with thresholds less than $s$ are set to 0, so:

[TABLE]

Therefore, constraint 8a is satisfied. Furthermore, in this case 8b is trivially satisfied, since

[TABLE]

For the case where $x_{V(s)C(s)}=1$ , the argument is very similar. In particular, since $x_{V(s^{\prime})C(s^{\prime})}=1$ , for all thresholds higher than $s$ , it follows that

[TABLE]

Analyzing constraint 8b:

[TABLE]

where $s_{+1},s_{+2}...$ are thresholds immediately above $s$ . It follows that

[TABLE]

Again, 8a is trivially satisfied. We also have $Q^{expset}\subseteq Q^{misic}$ since:

[TABLE]

This occurs because $\textbf{left}\subseteq\textbf{below}$ and $\textbf{right}\subseteq\textbf{above}$ . \Halmos

10 Proof of Proposition 5.2

Proof 10.1

For convenience, we recall the definition polyhedra $\tilde{Q}^{misic}(s,s^{\prime})$ and $\tilde{Q}^{expset}(s,s^{\prime}):$

[TABLE]

There are three cases that need to be examined: where split $s$ is a child of split $s^{\prime}$ , where $s^{\prime}$ is a child of split $s$ , and where neither is a child of the other because they are on branches of the tree. Recall that in all cases, $s^{\prime}$ is the split with a larger threshold.

We start with the case where $s$ is a (left) child of split $s^{\prime}$ . An example of this occurs in Figure 2(b). Take the solution $\bm{x}^{(1)},\bm{z}^{(1)}$ such that $\sum_{l\in\textbf{left}(s)}z_{l}^{(1)}=0,~{}\sum_{l\in\textbf{left}(s)}z_{l}^{(1)}=0.5,~{}\sum_{l\in\textbf{right}(s)}z_{l}^{(1)}=0.5,~{}\sum_{l\in\textbf{right}(s^{\prime})}z_{l}^{(1)}=0.5,~{}x_{V(s)C(s)}^{(1)}=0.5,~{}x_{V(s^{\prime})C(s^{\prime})}^{(1)}=0.5$ . By inspection, $\bm{x}^{(1)},\bm{z}^{(1)}\in\tilde{Q}^{misic}(s,s^{\prime})$ . This doesn’t necessarily violate $\sum_{1=1}^{p}z_{l}^{(1)}=1$ since, $\textbf{left}(s^{\prime})\supseteq\textbf{left}(s)\cup\textbf{right}(s)$ .

Since $s^{\prime}$ is the greater split, we have that $\textbf{above}(s)\supseteq\textbf{right}(s)\cup\textbf{right}(s^{\prime})$ . Furthermore, $\textbf{right}(s)\cap\textbf{right}(s^{\prime})=\emptyset$ , since $s$ is the left child of $s^{\prime}$ . It follows that the solution $\bm{x}^{(1)},\bm{z}^{(1)}\notin\tilde{Q}^{expset}(s,s^{\prime})$ since

[TABLE] 2. 2.

We next examine the case where $s^{\prime}$ is a (right) child of split $s$ , which is very similar but included for completeness. Take the solution $\bm{x}^{(2)},\bm{z}^{(2)}$ such that $\sum_{l\in\textbf{left}(s)}z_{l}^{(2)}=0.5,~{}\sum_{l\in\textbf{left}(s)}z_{l}^{(2)}=0.5,~{}\sum_{l\in\textbf{right}(s)}z_{l}^{(2)}=0.5,~{}\sum_{l\in\textbf{right}(s^{\prime})}z_{l}^{(2)}=0,~{}x_{V(s)C(s)}^{(2)}=0.5,~{}x_{V(s^{\prime})C(s^{\prime})}^{(2)}=0.5$ . By inspection, $\bm{x}^{(2)},\bm{z}^{(2)}\in\tilde{Q}^{misic}(s,s^{\prime})$ .

Since $s^{\prime}$ is the greater split, we have that $\textbf{below}(s^{\prime})\supseteq\textbf{left}(s)\cup\textbf{left}(s^{\prime})$ . Furthermore, $\textbf{left}(s)\cap\textbf{left}(s^{\prime})=\emptyset$ , since $s^{\prime}$ is the right child of $s$ . It follows that the solution $\bm{x}^{(2)},\bm{z}^{(2)}\notin\tilde{Q}^{expset}(s,s^{\prime})$ since

[TABLE] 3. 3.

Finally, we examine the case where neither split is a child of the other, which is also very similar to the case above. An example of this occurs in Figure 4(a). Take the solution $\bm{x}^{(3)},\bm{z}^{(3)}$ such that $\sum_{l\in\textbf{left}(s)}z_{l}^{(3)}=0.5,~{}\sum_{l\in\textbf{left}(s)}z_{l}^{(3)}=0.5,~{}\sum_{l\in\textbf{right}(s)}z_{l}^{(3)}=0,~{}\sum_{l\in\textbf{right}(s^{\prime})}z_{l}^{(3)}=0,~{}x_{V(s)C(s)}^{(3)}=0.5,~{}x_{V(s^{\prime})C(s^{\prime})}^{(3)}=0.5$ . By inspection, $\bm{x}^{(3)},\bm{z}^{(3)}\in\tilde{Q}^{misic}(s,s^{\prime})$ .

Since $s^{\prime}$ is the greater split, we have that $\textbf{below}(s^{\prime})\supseteq\textbf{left}(s)\cup\textbf{left}(s^{\prime})$ . Furthermore, $\textbf{left}(s)\cap\textbf{left}(s^{\prime})=\emptyset$ , since neither node is a child of the other. It follows that the solution $\bm{x}^{(3)},\bm{z}^{(3)}\notin\tilde{Q}^{expset}(s,s^{\prime})$ since

[TABLE]

*For this case, there is also another fractional solution $\bm{x}^{(4)},\bm{z}^{(4)}$ such that $\sum_{l\in\textbf{left}(s)}z_{l}^{(4)}=0,~{}\sum_{l\in\textbf{left}(s)}z_{l}^{(4)}=0,~{}\sum_{l\in\textbf{right}(s)}z_{l}^{(4)}=0.5,~{}\sum_{l\in\textbf{right}(s^{\prime})}z_{l}^{(4)}=0.5,~{}x_{V(s)C(s)}^{(4)}=0.5,~{}x_{V(s^{\prime})C(s^{\prime})}^{(4)}=0.5$ , which can be proven with a very similar argument. *\Halmos

11 Proof of Proposition 5.5

Proof 11.1

Let $\bm{z},\bm{x}$ be any feasible solution to $Q^{misic}\cap(\{0,1\}^{p}\times\mathbb{R}^{1+p})$ , where $\bm{x}$ is restricted to a binary lattice. We will show that $\bm{z},\bm{x}$ is feasible for $Q^{elbow}\cap(\{0,1\}^{p}\times\mathbb{R}^{1+p})$ .

In particular, we will show that the $\bm{z},\bm{x}$ satisfies 9a. Consider two splits $s$ and $s^{\prime}$ covered by the constraint 9a, where $s^{\prime}\in\textbf{right\_parent}(s)$ , that is $s^{\prime}$ is above and to the right of $s$ in the tree. We will investigate the different feasible values for $x_{V(s)C(s)},x_{V(s^{\prime})C(s^{\prime})}$ , specifically $x_{V(s)C(s)},x_{V(s^{\prime})C(s^{\prime})}\in\{(0,0),(0,1),(1,1)\}$ . Note that $x_{V(s)C(s)}=1,x_{V(s^{\prime})C(s^{\prime})}=0$ is not a feasible solution since it violates the constraint 2c, $x_{ij}\leq x_{ij+1}$ .

Suppose $x_{V(s)C(s)}=0$ and $x_{V(s^{\prime})C(s^{\prime})}=0$ . From constraint 2a, $\sum_{l\in\textbf{left}(s^{\prime})}z_{l}\leq 0$ . However, since $s^{\prime}\in\textbf{right\_parent}(S)$ , then $\textbf{right}(s)\subset\textbf{left}(s^{\prime})$ . Therefore, $\sum_{l\in\textbf{right}(s)}z_{l}\leq 0$ . As a result, 9a is satisfied since:

[TABLE]

Suppose $x_{V(s)C(s)}=0,x_{V(s^{\prime})C(s^{\prime})}=1$ , then $\sum_{l\in\textbf{right}(s)}z_{l}\leq x_{V(s^{\prime})C(s^{\prime})}-x_{V(s)C(s)}=1$ is immediately satisfied. Suppose $x_{V(s)C(s)}=1,x_{V(s^{\prime})C(s^{\prime})}=1$ , then from 2b, $\sum_{l\in\textbf{right}(s)}z_{l}\leq 1-x_{V(s)C(s)}=0$ , therefore 9a is also satisfied.

It folows that $Q^{elbow}\subseteq Q^{misic}$ , since $Q^{elbow}$ only has constraints added in addition to the constraints in $Q^{misic}$ . \Halmos

12 Proof Theorem 5.3

Proof 12.1

For the case when $d=1$ , constraint (8b) can be rearranged:

[TABLE]

This is because in the single feature case, the sets $\textbf{above}(s)$ and $\textbf{below}(s)$ are complementary. Note that this isn’t the case with multiple features, as generally there will be leaves in the tree that do not split on a feature. This implies $\sum_{l\in\textbf{below}(s)}z_{l}=x_{V(s)C(s)}~{}\forall s~{}\in~{}\textbf{splits}(t)$ for $d=1$ .

We now define the matrix $A$ by ordering the constraints in a specific way. We order the rows corresponding to variables $x_{j}$ from smallest to largest according to the size of the threshold to which they correspond. Furthermore, suppose leaves $z_{lt}$ are labelled in increasing order, so that $z_{1t}$ is the leaf corresponding to smallest threshold for tree $t$ , while $z_{12t}$ is the next smallest. In this case, $A$ will take the following form:

[TABLE]

In the top right, we have the negative identity matrix, corresponding to each $x$ . In the bottom right, we have the constraints $x_{ij}\leq x_{ij+1}$ . In the top left, we have blocks of rows, each corresponding to leaves in a tree. In the example given, columns 1-3 correspond to leaves from tree 1 and columns 4-6 from tree 2. Due to the construction of the set below where $\textbf{below}(s_{ij+1})=\textbf{below}(s_{ij})\cup\textbf{left}(s_{ij+1})$ for subsequent ordered splits in the tree, these rows have a lower triangular structure.

To prove any subset of this matrix is totally unimodular, we use the following lemma, originally from Ghouila-Houri (1962), presented as Theorem 19.3 in Schrijver (1998):

Lemma 12.2

(Ghouila-Houri) Each collection of columns of A can be split into two parts so that the sum of the columns in one part minus the sum of the columns in the other part is a vector with entries only 0, + 1, and - 1

To construct these sets, we allocate the first column (available in the subset of columns) of each tree to group 1. We then alternate through the remaining columns, assigning every second (available) column in the tree to group -1. This ensures that the sum of each row is 1 or 0 for the left columns corresponding to the $z_{l}$ leaf variables available in the subset, due to the consecutive ones property of the lower triangular matrix. We assign the remaining available columns (corresponding to $x$ variables) to group 1. The -1 from the identity matrix, if present in the subset, will reduce the sum to -1 or 0. For the lower half of the matrix corresponding to $x_{ij}\leq x_{ij+1}$ , there is at most one 1, and one -1. Since these are assigned to the same group, the sum of the columns for these rows is either 0, + 1, or - 1. As a result, the total sum of all columns for all subsets is either 0, + 1, or - 1. This assignment is illustrated below for a sample matrix where $\sigma$ corresponds to the group assignment.

[TABLE]

13 Proof of Lemma 5.7

Proof 13.1

We will prove the first statement, while the proof for the second statement is almost identical. To reduce the notation, assume the sets below are intercepted with $Q^{misic}$ .

[TABLE]

The second-to-last implication follows because $\textbf{right}(s)\subset\textbf{below}(s^{\prime})$ and $\textbf{right}(s)\subset\textbf{above}(s)$ . The last implication occurs because $\textup{{below}}(s^{\prime})\cup\textup{{above}}(s)=p$ , when combined with $\sum_{l=1}^{p}z_{l}=1$ , we have that $\sum_{l\in\textbf{below}(s^{\prime})\setminus\textbf{right}(s)}z_{l}+\sum_{l\in\textbf{above}(s)\setminus\textbf{right}(s)}z_{l}+\sum_{l\in\textbf{right}(s)}z_{l}=1$ . \Halmos

Bibliography48

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Amram et al. (2022) Amram M, Dunn J, Zhuo YD (2022) Optimal policy trees. Machine Learning 1–28.
2Anderson et al. (2018) Anderson R, Huchette J, Tjandraatmadja C, Vielma JP (2018) Strong convex relaxations and mixed-integer programming formulations for trained neural networks. ar Xiv preprint ar Xiv:1811.01988 .
3Ashlagi et al. (2018) Ashlagi I, Bingaman A, Burq M, Manshadi V, Gamarnik D, Murphey C, Roth AE, Melcher ML, Rees MA (2018) Effect of match-run frequencies on the number of transplants and waiting times in kidney exchange. American Journal of Transplantation 18(5):1177–1186.
4Balas (1985) Balas E (1985) Disjunctive programming and a hierarchy of relaxations for discrete optimization problems. SIAM Journal on Algebraic Discrete Methods 6(3):466–486.
5Bent and Van Hentenryck (2007) Bent R, Van Hentenryck P (2007) Waiting and relocation strategies in online stochastic vehicle routing. IJCAI , 1816–1821.
6Bergman et al. (2022) Bergman D, Huang T, Brooks P, Lodi A, Raghunathan AU (2022) Janos: an integrated predictive and prescriptive modeling framework. INFORMS Journal on Computing 34(2):807–816.
7Bertsimas et al. (2019) Bertsimas D, Dunn J, Mundru N (2019) Optimal prescriptive trees. INFORMS Journal on Optimization 1(2):164–183.
8Bertsimas et al. (2016) Bertsimas D, O’Hair A, Relyea S, Silberholz J (2016) An analytics approach to designing combination chemotherapy regimens for cancer. Management Science 62(5):1511–1531.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

1 Introduction

1.1 Contributions

2 Preliminaries

2.1 Decision trees

2.2 Mixed-integer optimization

Definition 2.1** **(Valid mixed-integer optimization formulation)

Definition 2.2** **(Ideal formulation)

3 Further relevant literature

3.1 Formulation from Mišić (2020)

Example 3.1** **(Mišić (2020) not ideal for a single tree)

Example 3.2** **(Biggs et al. (2022) not ideal for a single tree)

4 Union of polyhedron formulation

Theorem 4.1** **(Ideal formulation for a tree)

Definition 4.2** **(Facet)

Lemma 4.3

4.1 Extensions to tree ensembles and additional constraints

Definition 4.4** **(Sharp formulation)

Example 4.5** **(Intersection of trees is not ideal or sharp)

Example 4.6** **(Adding additional constraints to a tree is not ideal)

5 Strengthening formulations with binary split variables

5.1 Tighter formulation from variable structure

Proposition 5.1

Proposition 5.2

Theorem 5.3** **(Ideal formulation for one-dimensional tree ensembles)

5.2 Tighter formulation from nested branches

Example 5.4** **(Nested branches that can be tightened)

Proposition 5.5

5.3 Comparison of tightening constraints

Example 5.6** **(Expset Formulation is tighter than elbow formulation)

Lemma 5.7

Example 5.8** **(Elbow formulation is tighter than expset formulation)

6 Numerical Experiments

6.1 Experiments with tree ensembles

6.1.1 Tighter linear relaxations

6.2 Real-world data

6.2.1 Solve time

6.2.2 Tightness of linear relaxation

7 Conclusions and future work

8 Proofs from paper

8.1 Proof Theorem 4.1

Proof 8.1

8.2 Proof Lemma 4.3

Proof 8.2

9 Proof of Proposition 5.1

Proof 9.1

10 Proof of Proposition 5.2

Proof 10.1

11 Proof of Proposition 5.5

Proof 11.1

12 Proof Theorem 5.3

Proof 12.1

Lemma 12.2

13 Proof of Lemma 5.7

Proof 13.1

Definition 2.1 (Valid mixed-integer optimization formulation)

Definition 2.2 (Ideal formulation)

Example 3.1 (Mišić (2020) not ideal for a single tree)

Example 3.2 (Biggs et al. (2022) not ideal for a single tree)

Theorem 4.1 (Ideal formulation for a tree)

Definition 4.2 (Facet)

Definition 4.4 (Sharp formulation)

Example 4.5 (Intersection of trees is not ideal or sharp)

Example 4.6 (Adding additional constraints to a tree is not ideal)

Theorem 5.3 (Ideal formulation for one-dimensional tree ensembles)

Example 5.4 (Nested branches that can be tightened)

Example 5.6 (Expset Formulation is tighter than elbow formulation)

Example 5.8 (Elbow formulation is tighter than expset formulation)