Stochastic Lipschitz Dynamic Programming

Shabbir Ahmed; Filipe Goulart Cabral; Bernardo Freitas Paulo da Costa

arXiv:1905.02290·math.OC·May 24, 2019·Math. Program.

Stochastic Lipschitz Dynamic Programming

Shabbir Ahmed, Filipe Goulart Cabral, Bernardo Freitas Paulo da Costa

PDF

TL;DR

This paper introduces a novel algorithm for multistage stochastic MILPs that uses Lipschitz cuts to improve lower bounds, demonstrated through case studies comparing with existing methods like SDDP and SDDiP.

Contribution

It develops a new Lipschitz cut-based approach for stochastic MILPs that maintains problem class integrity and enhances solution quality.

Findings

01

The proposed algorithm effectively approximates non-convex cost functions.

02

Application to case studies shows competitive performance with existing methods.

03

Lipschitz cuts derived from Augmented Lagrangian Duality are MILP representable.

Abstract

We propose a new algorithm for solving multistage stochastic mixed integer linear programming (MILP) problems with complete continuous recourse. In a similar way to cutting plane methods, we construct nonlinear Lipschitz cuts to build lower approximations for the non-convex cost to go functions. An example of such a class of cuts are those derived using Augmented Lagrangian Duality for MILPs. The family of Lipschitz cuts we use is MILP representable, so that the introduction of these cuts does not change the class of the original stochastic optimization problem. We illustrate the application of this algorithm on two simple case studies, comparing our approach with the convex relaxation of the problems, for which we can apply SDDP, and for a discretized approximation, applying SDDiP.

Tables3

Table 1. Table 1: Results for an 8-stage non-convex problem

	SB	SLDP tents	SLDP ALD	SDDiP 0.1	SDDiP 0.01
LB	1.167	3.073	3.085	3.420	2.370
UB	3.453	3.320	3.313	3.823	3.490
time (s)	12	558	605	1994	3317

Table 2. Table 2: Results for discrete first stage

N	2	3	6
Objective	-57.000	-59.333	-61.222
SB LB	-58.096	-61.961	-65.468
ALD LB	-57.000	-59.333	-61.274
remaining (%)	0	0	1.27
SB time (s)	0.98	1.60	005.57
ALD time (s)	19.4	77.8	125.0
ALD/SB time	19.8	48.6	22.4

Table 3. Table 3: Results for continuous first stage

N	2	3	6
Objective	-57.000	-59.333	-61.222
SB LB	-58.095	-61.961	-65.541
ALD LB	-57.000	-59.495	-61.579
remaining (%)	0	6.27	8.31
SB time (s)	000.63	001.30	005.22
ALD time (s)	260	532	758
ALD/SB time	413	409	145

Equations165

f (x) = ⎩ ⎨ ⎧ x 1 3 - x : 0 \leq x \leq 1, : 1 \leq x \leq 2, : 2 \leq x \leq 3,

f (x) = ⎩ ⎨ ⎧ x 1 3 - x : 0 \leq x \leq 1, : 1 \leq x \leq 2, : 2 \leq x \leq 3,

C_{ρ, \overline{x}} (x) = f (\overline{x}) - ρ \cdot ∥ x - \overline{x} ∥.

C_{ρ, \overline{x}} (x) = f (\overline{x}) - ρ \cdot ∥ x - \overline{x} ∥.

C_{ρ, \overline{x}} (x)

C_{ρ, \overline{x}} (x)

\leq f (x) + L \cdot ∥ x - \overline{x} ∥ - ρ ∥ x - \overline{x} ∥

\leq f (x),

g (x) = x - ⌊ x ⌋ = y \in Z, y \leq x min x - y,

g (x) = x - ⌊ x ⌋ = y \in Z, y \leq x min x - y,

\begin{array}[]{rl}\nu=\underset{x}{\min}&f(x)+g(x)\\ \text{s.t.}&x\in X,\end{array}

\begin{array}[]{rl}\nu=\underset{x}{\min}&f(x)+g(x)\\ \text{s.t.}&x\in X,\end{array}

\begin{array}[]{rl}\nu^{k}=\underset{x}{\min}&f(x)+g^{k}(x)\\ \text{s.t.}&x\in X,\end{array}

\begin{array}[]{rl}\nu^{k}=\underset{x}{\min}&f(x)+g^{k}(x)\\ \text{s.t.}&x\in X,\end{array}

g^{k + 1} (x) = max {g^{k} (x), g (x^{k}) - ρ \cdot ∥ x - x^{k} ∥} .

g^{k + 1} (x) = max {g^{k} (x), g (x^{k}) - ρ \cdot ∥ x - x^{k} ∥} .

g^{k + 1} (x) = max {g^{k} (x), g (x^{k}) - ρ \cdot ∥ x - x^{k} ∥} .

g^{k + 1} (x) = max {g^{k} (x), g (x^{k}) - ρ \cdot ∥ x - x^{k} ∥} .

k \in K lim g^{k} (x^{k}) = g (x^{*}) .

k \in K lim g^{k} (x^{k}) = g (x^{*}) .

g (x^{*}) - g^{k} (x^{k})

g (x^{*}) - g^{k} (x^{k})

\leq L \cdot ∥ x^{*} - x^{k} ∥ + g (x^{k}) - g^{k} (x^{k}) .

g^{k} (x^{k}) \geq g^{k} (x^{j}) - ρ \cdot ∥ x^{k} - x^{j} ∥.

g^{k} (x^{k}) \geq g^{k} (x^{j}) - ρ \cdot ∥ x^{k} - x^{j} ∥.

g (x^{k}) - g^{k} (x^{k})

g (x^{k}) - g^{k} (x^{k})

\leq L \cdot ∥ x^{k} - x^{j} ∥ + ρ \cdot ∥ x^{k} - x^{j} ∥.

ν^{k} \leq ν \leq f (x^{k}) + g (x^{k}) \leq f (x^{k}) + g^{k} (x^{k}) + [g (x^{k}) - g^{k} (x^{k})] \leq ν^{k} + ε,

ν^{k} \leq ν \leq f (x^{k}) + g (x^{k}) \leq f (x^{k}) + g^{k} (x^{k}) + [g (x^{k}) - g^{k} (x^{k})] \leq ν^{k} + ε,

f (x^{k}) + g (x^{k}) \geq ν \geq ν_{k} = f (x^{k}) + g^{k} (x^{k}) .

f (x^{k}) + g (x^{k}) \geq ν \geq ν_{k} = f (x^{k}) + g^{k} (x^{k}) .

f (x^{*}) + g (x^{*}) \geq ν \geq k \in K lim ν_{k} = f (x^{*}) + g (x^{*}),

f (x^{*}) + g (x^{*}) \geq ν \geq k \in K lim ν_{k} = f (x^{*}) + g (x^{*}),

\begin{array}[]{rl}g(b)=\min_{x\in X}&c^{\top}x\\ \text{s.t.}&Ax=b\end{array}

\begin{array}[]{rl}g(b)=\min_{x\in X}&c^{\top}x\\ \text{s.t.}&Ax=b\end{array}

g^{A L} (b; λ, ρ) = x \in X min c^{⊤} x - λ^{⊤} (A x - b) + ρ \cdot ψ (A x - b) .

g^{A L} (b; λ, ρ) = x \in X min c^{⊤} x - λ^{⊤} (A x - b) + ρ \cdot ψ (A x - b) .

g^{A L} (b; λ, ρ) \leq g (b) .

g^{A L} (b; λ, ρ) \leq g (b) .

λ sup g^{A L} (b_{0}; λ, ρ^{*}) = g (b_{0}) .

λ sup g^{A L} (b_{0}; λ, ρ^{*}) = g (b_{0}) .

g (b)

g (b)

= x \in X min c^{⊤} x - λ^{⊤} (A x - b) + ρ ∥ A x - b ∥

= x \in X min c^{⊤} x - λ^{⊤} (A x - b_{0} + b_{0} - b) + ρ ∥ A x - b_{0} + b_{0} - b ∥ .

∥ A x - b ∥ \geq ∥ A x - b_{0} ∥ - ∥ b - b_{0} ∥,

∥ A x - b ∥ \geq ∥ A x - b_{0} ∥ - ∥ b - b_{0} ∥,

g (b)

g (b)

= g^{A L} (b_{0}; λ, ρ) + λ^{⊤} (b_{0} - b) - ρ ∥ b_{0} - b ∥ .

A L C_{b_{0}, λ, ρ} (b) := g^{A L} (b_{0}; λ, ρ) + λ^{⊤} (b_{0} - b) - ρ ∥ b_{0} - b ∥ .

A L C_{b_{0}, λ, ρ} (b) := g^{A L} (b_{0}; λ, ρ) + λ^{⊤} (b_{0} - b) - ρ ∥ b_{0} - b ∥ .

\begin{array}[]{rl}\nu(b)=\min_{x}&f(x)\\ \text{s.t.}&(x,b)\in P.\end{array}

\begin{array}[]{rl}\nu(b)=\min_{x}&f(x)\\ \text{s.t.}&(x,b)\in P.\end{array}

S (b) \subseteq S (b^{*}) + r \cdot ∥ b - b^{*} ∥ \cdot Δ,

S (b) \subseteq S (b^{*}) + r \cdot ∥ b - b^{*} ∥ \cdot Δ,

\begin{array}[]{rl}\nu(b)=\min&f(x)\\ \textrm{s.t.}&(x,b)\in P,\end{array}

\begin{array}[]{rl}\nu(b)=\min&f(x)\\ \textrm{s.t.}&(x,b)\in P,\end{array}

dom (ν) = proj_{b} (P) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Stochastic Lipschitz Dynamic Programming

Shabbir Ahmed

Filipe Goulart Cabral

Bernardo Freitas Paulo da Costa

Abstract

We propose a new algorithm for solving multistage stochastic mixed integer linear programming (MILP) problems with complete continuous recourse. In a similar way to cutting plane methods, we construct nonlinear Lipschitz cuts to build lower approximations for the non-convex cost-to-go functions. An example of such a class of cuts are those derived using Augmented Lagrangian Duality for MILPs. The family of Lipschitz cuts we use is MILP representable, so that the introduction of these cuts does not change the class of the original stochastic optimization problem.

We illustrate the application of this algorithm on two simple case studies, comparing our approach with the convex relaxation of the problems, for which we can apply SDDP, and for a discretized approximation, applying SDDiP.

1 Introduction

Non-convex stochastic programming problems arise naturally in models that consider binary or integer variables, since such variables allow the representation of a wide variety of constraints at the cost of inducing non-convex feasible sets. Recent advances in commercial solvers have made possible and more robust the solution of several mixed integer linear programming problems, broadening the interest of the academic community in studying and developing algorithms for the mixed integer stochastic programming area. Applications such as unit commitment [Tahanan et al., 2015], [Costley et al., 2017], [Knueven et al., 2018], [Ackooij et al., 2018], optimal investment decisions [Singh et al., 2009], [Conejo et al., 2016], and power system operational planning [Cerisola et al., 2012], [Thome et al., 2013], [Hjelmeland et al., 2019] have driven the development of new algorithms for problems that do not fit into the convex optimization framework.

Before the development of the MIDAS [Philpott et al., 2016] and the SDDiP [Zou et al., 2018] algorithms, the solution of large-scale multistage stochastic programming problems with theoretical guarantees was restricted to convex problems, using algorithms such as Nested Cutting Plane [Glassey, 1973], Progressive Hedging [Rockafellar and Wets, 1991] and SDDP [Pereira and Pinto, 1991]. For a recent reference, see [Birge and Louveaux, 2011] and [Shapiro et al., 2014]. Most of those algorithms build convex underapproximations of the cost-to-go function at each node of the scenario tree, or at each stage in the stagewise independent setting, to solve the stochastic convex program. Even the SDDiP algorithm relies on convex underapproximations, which are shown to be sufficient for convergence thanks to the binary discretization of the state variables and the tightness property of the Lagrangian cuts at binary states. The MIDAS algorithms is based on a different idea, using step functions to approximate monotone non-convex cost-to-go functions.

The aim of this paper is to propose a new algorithm for mixed integer multistage stochastic programming problems, which does not discretize the state variable and uses non-linear cuts to capture non-convexities in the future cost function. Inspired by the exactness results from [Feizollahi et al., 2017], we build non-convex underapproximations of the expected cost-to-go function, whose basic pieces are augmented Lagrangian cuts. They can be calculated from augmented Lagrangian duality, which [Feizollahi et al., 2017] have shown to be exact for mixed-integer linear problems when the augmentation function is, for example, the $L^{1}$ -norm. If the original problem already had pure binary state variables, then, as it was shown in [Zou et al., 2018], Lagrangian cuts are already tight, and we can use them in our algorithm by setting the non-linear term of the augmented Lagrangian to zero.

As we will see, even a countable number of these cuts cannot describe exactly the value function of a mixed-integer linear program, even when that function is continuous. However, the $L^{1}$ cuts have sufficient structure for our purposes, in two very important ways: first, they are Lipschitz functions, which still yield global estimates from local behaviour, even if those are weaker than what’s typical with convexity. The Lipschitz estimates will be crucial for our convergence arguments, and the resulting algorithm is therefore called Stochastic Lipschitz Dynamic Programming (SLDP).

Second, it is possible to represent such $L^{1}$ cuts using binary variables and a system of linear equalities and inequalities. Therefore, we do not leave the class of mixed-integer linear problems when incorporating $L^{1}$ -augmented Lagrangian cuts in each node/stage of the stochastic problem. Under some hypothesis for continuous recourse of each stage, it is possible to prove that the expected cost-to-go functions of stochastic mixed-integer linear problems are Lipschitz, and therefore our algorithm can be applied directly.

We have organized this paper as follows: the next section motivates the SLDP algorithm with a study of Lipschitz optimal value functions and a decomposition algorithm for deterministic Lipschitz optimization. Then, we present the SLDP algorithm for both the full-tree and sampled scenario cases, and prove their convergence. Finally, we illustrate our results with a case study.

2 Lipschitz value functions

Our SLDP algorithm uses Lipschitz cuts to approximate the cost-to-go function of a stochastic MILP in a similar fashion to the nested cutting planes algorithm for stochastic linear programs [Ruszczynski and Shapiro, 2003]. That is, we also compute lower approximations that iteratively improve the cost-to-go approximation in a neighborhood of the optimal solution.

The purpose of this section is to motivate the use of Lipschitz cuts for nonconvex functions, especially for optimal value functions of mixed integer problems. We start with the definition and basic properties of Lipschitz functions as well as the convergence proof of an algorithm for Lipschitz optimization that employs special Lipschitz cuts called reverse norm cuts. The idea of reverse norm cuts also appears in Global Lipschitz Optimization (GLO), see [Mayne and Polak, 1984], [Meewella and Mayne, 1988] and more recently in [Malherbe and Vayatis, 2017]. However, our aim is not to develop a new algorithm for GLO but to explain the reverse norm cut algorithm and establish some results to extend it for stochastic multistage MILP programs.

Then, we recall the definition of Augmented Lagrangian duality and the exactness results in [Feizollahi et al., 2017] and argue how they can be used to construct augmented Lagrangian cuts in place of the reverse norm cuts. This provides a unified framework, generalizing both nonlinear reverse norm cuts and the linear Lagrangian cutting algorithms in the continuous setting of [Thome et al., 2013] or the binary setting of SDDiP [Zou et al., 2018].

Finally, we show how this theory applies to MILP value functions.

2.1 Lipschitz functions and reverse norm cuts

Let us recall the definition and some results for Lipschitz functions. We say that $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is a Lipschitz function with constant $L>0$ if for all $x,y\in\mathbb{R}^{d}$ we have $|f(x)-f(y)|\leq L\|x-y\|$ . Note that the linear function $f(x)=a^{\top}x$ is Lipschitz with constant $L=\|a\|_{*}$ , where $\|\cdot\|_{*}$ indicates the dual norm. Let $f_{1}$ and $f_{2}$ be Lipschitz functions with constants $L_{1}$ and $L_{2}$ , respectively. It can be shown that

•

the maximum and minimum of $f_{1}$ and $f_{2}$ are Lipschitz functions with constant $\max\{L_{1},L_{2}\}$ ; and

•

the sum of $f_{1}$ and $f_{2}$ is a Lipschitz function with constant $L_{1}+L_{2}$ .

Consider now an example of a non-convex Lipschitz function. Let $f$ be the piecewise linear function defined on $[0,3]$ by

[TABLE]

see figure 1 for an illustration. Note that $f$ can be written as the minimum of linear functions, $f(x)=\min\{x,1,3-x\}$ , so $f$ is a Lipschitz function with constant $L=1$ .

It can also be seen from figure 1 that the tightest lower convex approximation $g$ for $f$ over $[0,3]$ is the zero function. This implies that there will always be a gap between linear cuts and the original function $f$ , since the best they could do is reproduce the Lagrangian relaxation $g$ . In other words, there is no way to use lower linear cuts to close the gap with $f$ , which motivates the introduction of reverse norm cuts.

Definition 1.

A reverse norm cut for a function $f$ , centered at $\overline{x}$ and with parameter $\rho$ , is the function

[TABLE]

Note that $C_{\rho,\overline{x}}(x)$ is a Lipschitz function with constant $\rho$ , and if $C_{1}(x)$ and $C_{2}(x)$ are reverse norm cuts with parameters $\rho_{1}$ and $\rho_{2}$ then $\max\{C_{1},C_{2}\}(x)$ is a Lipschitz function with constant $\max\{\rho_{1},\rho_{2}\}$ .

Suppose that $f$ is a Lipschitz function with constant $L$ . If $\rho$ is greater than or equal to $L$ , then all functions $C_{\rho,\overline{x}}(x)$ are valid reverse norm cuts, for any center point $\overline{x}$ . Indeed,

[TABLE]

for all $x\in\mathbb{R}^{d}$ .

A fundamental difference between reverse norm cuts and cutting planes is that, in general, one cannot guarantee that a piecewise linear Lipschitz function $f$ can always be represented as the maximum of a finite number of reverse norm cuts. Indeed, as can be seen in figure 2a, one would need all reverse norm cuts centered at every $\overline{x}\in[1,2]$ to represent the non-convex function $f(x)$ .

Sometimes, it is possible to obtain a reverse norm cut centered at a point $\overline{x}$ with Lipschitz constant $\rho$ less than $L$ , but such reverse norm cut may not be a lower approximation elsewhere if translated in the domain. In figure 2b, we show a tighter cut with $\rho=2/3$ centered at $\overline{x}=1.5$ , but this lower $\rho$ cannot be used for any other point in the domain of $f$ . If the constant $\rho$ is greater than $L$ , the corresponding reverse norm cut is a lower approximations of $f$ if centered at any point $\overline{x}$ by the same deduction made in (1).

Finally, observe that if the function $f$ is not Lipschitz then reverse norm cuts may need arbitrarily large parameters, depending on their center point. Indeed, let $g(x)$ be the fractional-part function:

[TABLE]

which is both piecewise linear and the optimal value function of a MILP, see figures 3a and 3b.

As the point $\overline{x}$ approaches $1$ , the opening of the reverse norm cuts centered at $\overline{x}$ goes to zero, so their corresponding Lipschitz constant $L$ must go to infinity.

2.2 Optimization with reverse norm cuts

To develop the intuition for the SLDP method that will be presented in section 3, we introduce the reverse cut method for a deterministic problem and its convergence proof. We hope that this simpler setting helps the reader understand some of the basic mechanisms of the proof without the notational burden of the general stochastic multistage setting.

The deterministic optimization problem that will be investigated is

[TABLE]

where $f(x)$ is a ‘simple’ function and $g(x)$ is a ‘complex’ one, so while it is relatively cheap to optimize $f$ , we only assume that it is possible to evaluate $g$ at given points. For example, this can be the stage problem in the nested decomposition of a multistage problem, where $f$ is the immediate cost and $g$ the average cost-to-go.

Assume that $g(x)$ is a Lipschitz function with constant $L$ and $X$ is a compact subset of $\mathbb{R}^{d}$ . The reverse cut method, presented in algorithm 1, similarly to decomposition methods, iteratively approaches $g(x)$ by reverse norm cuts centered at points obtained along the iterations of the algorithm. Those points are solutions of an approximated optimization problem obtained from (2), where $g(x)$ is replaced by the current Lipschitz approximation $g^{k}(x)$ .

The Lipschitz approximation $g^{k+1}(\cdot)$ is the maximum between the reverse norm cut centered at $x^{k}$ and the previous approximation $g^{k}(\cdot)$ :

[TABLE]

By induction, all $g^{k}(\cdot)$ are Lipschitz functions with constant $\rho$ , since it is the maximum of reverse norm cuts with that constant.

With this observation in mind, we will prove the convergence of algorithm 1. The compactness assumption of $X$ is used only to ensure the existence of a cluster point for the sequence of trial points $\{x^{k}\}_{k\in\mathbb{N}}$ .

Lemma 1.

Suppose Algorithm 1 does not meet the stopping criteria for problem (2). Let $x^{*}\in X$ be any cluster point of the sequence of trial points $\{x^{k}\}_{k\in\mathbb{N}}$ and let $\mathcal{K}$ be the indices of a subsequence that converges to $x^{*}$ . Then $\{g^{k}(x^{k})\}_{k\in\mathcal{K}}$ converges to $g(x^{*})$ :

[TABLE]

Proof.

We will bound $|g(x^{*})-g^{k}(x^{k})|$ by the distance between the subsequence $\{x^{k}\}_{k\in\mathcal{K}}$ and the cluster point $x^{*}$ . Using the triangular inequality and the Lipschitz definition, we get

[TABLE]

Since $x^{k}$ converges to $x^{*}$ along $k\in\mathcal{K}$ , we only need to prove that the lower bounds $g^{k}$ will get arbitrarily close to $g$ at the trial points, before it is updated. By construction of the reverse cut method, $g^{k}(x^{k})$ is less than or equal to $g(x^{k})$ , so we only need upper bounds. Let $j$ be the index just before $k$ in the subsequence $\mathcal{K}$ . Since $g^{k}$ is $\rho$ -Lipschitz,

[TABLE]

But also by construction, $g^{k}(x^{j})=g(x^{j})$ , so subtracting the equation above from $g(x^{k})$ and applying once again the Lipschitz hypothesis on $g$ we obtain:

[TABLE]

By replacing (5) in (4) and taking the limit over $\mathcal{K}$ , we conclude that the sequence $\{g^{k}(x^{k})\}_{k\in\mathcal{K}}$ converges to $g(x^{*})$ . ∎

From the convergence result above, we now separate the analysis of the cases where $\varepsilon$ is strictly positive and zero respectively in Corollary 2 and Theorem 3 below.

Corollary 2.

For any stopping tolerance $\varepsilon>0$ , Algorithm 1 stops in a finite number of iterations with an $\varepsilon$ -optimal solution for (2).

Proof.

From inequality 5 in the proof of Lemma 1 and compactness of $X$ , Algorithm 1 will stop in a finite number of iterations if the stopping tolerance $\varepsilon$ is strictly positive. Using the fact that $x^{k}\in X$ is a feasible solution to (2) and optimal solution to (3) such that $g(x^{k})-g^{k}(x^{k})\leq\varepsilon$ , where that $g^{k}$ is a lower estimate for $g$ , we have the following inequalities:

[TABLE]

which means that $f(x^{k})+g(x^{k})$ is bounded between $\nu$ and $\nu+\varepsilon$ . ∎

Theorem 3.

Consider the stopping tolerance $\varepsilon$ equal to zero. Then, Algorithm 1 stops with an optimal solution in a finite number of iterations or it generates a sequence of optimal value approximations $\{\nu^{k}\}_{k\in\mathbb{N}}$ that converges to the optimal value $\nu$ of problem (2) and every cluster point $x^{*}$ of the sequence $\{x^{k}\}_{k\in\mathbb{N}}$ is a minimizer of (2).

Proof.

Suppose that Algorithm 1 stops after a finite number of iterations. Using the same argument of Corollary 2, we obtain that the last trial point $x^{k}$ is the optimal solution to (2).

Now, suppose that Algorithm 1 never reaches the stopping condition, so we know from Lemma 1 that the sequence $\{g^{k}(x^{k})\}_{k\in\mathcal{K}}$ converges to $g(x^{*})$ . Moreover, $x^{k}$ is a feasible solution to the main problem (2) and optimal solution to the approximate problem (3). Since $g^{k}$ is a lower estimate for $g$ , we obtain the following relationships:

[TABLE]

Taking the limit over $\mathcal{K}$ on both sides of (6), and by continuity of $f$ and $g$ , we obtain:

[TABLE]

which shows that all inequalities above are equalities, and therefore $x^{*}$ is an optimal solution to (2).

Going back to the full sequence, we recall that the sequence of objective functions $\{f(x)+g^{k}(x)\}_{k\in\mathbb{N}}$ is monotone nondecreasing, so the sequence of optimal values $\{\nu_{k}\}_{k\in\mathbb{N}}$ is also monotone and nondecreasing, which implies that $\{\nu_{k}\}_{k\in\mathbb{N}}$ also converges to $\nu$ . ∎

Observe that the proof of Lemma 1, Corollary 2 and Theorem 3 use only two properties of the reverse-norm cuts. Besides their Lipschitz character, they are exact at trial points, that is, $C_{\overline{x}}(\overline{x})=g(\overline{x})$ . Therefore, any other way of producing uniformly Lipschitz tight cuts for $g$ yields a convergent Lipschitz cut method for optimizing $f+g$ .

So, just before presenting a class of MILPs with Lipschitz value functions, we show how augmented Lagrangian duality can be used to produce tight Lipschitz cuts.

2.3 Augmented Lagrangian cuts

Recall the definition of augmented Lagrangian duality for mixed-integer optimization problems. Let

[TABLE]

be a parameterized linear problem, where $X$ describes the mixed-integer constraints.

Definition 2.

Given an augmenting function $\psi$ , which is non-negative and satisfies $\psi(0)=0$ , the augmented Lagrangian for problem (7) is given by

[TABLE]

Since any feasible solution to the original optimization problem (7) remains feasible with the same objective value for the augmented Lagrangian (8), we see that, for any $b$ , $\lambda$ and $\rho\geq 0$ ,

[TABLE]

Moreover, on the MILP setting, we have exact duality [Feizollahi et al., 2017, Theorem 4, p. 381] if the augmenting function $\psi$ is a norm. More precisely:

Theorem 4.

If the set $X$ is a rational polyhedron with integer constraints, if the problem data $A$ , $b_{0}$ and $c$ are rational and if for the given $b_{0}$ the problem is feasible with bounded value $g(b_{0})$ , then there is a finite $\rho^{*}$ such that

[TABLE]

In addition, one can choose a finite Lagrange multiplier $\lambda^{*}$ that attains the supremum.

This motivates the introduction of augmented Lagrangian cuts using norms as the augmentation function. Indeed, expanding the definition of $g^{AL}$ in the weak duality equation (9), we get for any $b$ :

[TABLE]

By the triangular inequality,

[TABLE]

so that

[TABLE]

Therefore, calculating the augmented Lagrangian at $b_{0}$ provides a lower Lipschitz estimate for $g(b)$ :

Definition 3.

If $g$ is the optimal value function for a MILP, and given $b_{0}$ , $\lambda$ and $\rho>0$ , the augmented Lagrangian cut centered at $b_{0}$ is the function

[TABLE]

Moreover, the exactness result above shows that there exists a sufficiently large $\rho$ and appropriate $\lambda$ for which $g^{AL}(b_{0};\lambda,\rho)=g(b_{0})$ , so the cut is tight at $b_{0}$ . However, this proof stills leaves open the question of whether the family of such cuts is uniformly Lipschitz, since as seen in the fractional-part example in figure 3 one might need arbitrarily large $\rho$ near discontinuities of the value function.

The Lipschitz setting that we assume for the value function allows us to bypass this difficulty: choosing $\rho=Lip(g)$ and $\lambda=0$ produces a valid and tight cut, so this provides an absolute upper bound on the needed $\rho$ for exact augmented Lagrangian duality. This is why we proceed to characterize some MILPs with Lipschitz value functions, before moving to the multistage case.

2.4 MILPs with Lipschitz value functions

As we have seen, the Lipschitz property of the value functions was an important piece in the proof of convergence of the reverse-norm method and also in the uniform bounds on $\rho$ for the augmented Lagrangian cuts. It will also be fundamental in the analysis of the SLDP algorithm, and therefore we present in this section a sufficient condition that guarantees Lipschitz continuity for the optimal value function of MILPs.

First, we show that the optimal value function for a Lipschitz objective over a family of polyhedra is still Lipschitz. Then, by enumerating the polyhedra over the realizations of integer variables, we will arrive at the complete continuous recourse (CCR) condition that ensures the optimal value function of a MILP is Lipschitz.

Consider the parameterized optimization problem

[TABLE]

for a Lipschitz function $f$ and a polyhedron $P$ . In order to analyze the function $\nu(b)$ , we have to understand the effect of the variations on the feasible set $P(b)$ with respect to the parameter $b$ , and compound this effect with the Lipschitz function $f$ . For the first part, the Hoffman Lemma below provides the answer, since it states that the symmetric difference between any two sets $P(b)$ and $P(b^{*})$ is bounded by a ball centered at the origin with a radius proportional to the norm $\|b-b^{*}\|$ . This result resembles a version for sets of the Lipschitz continuity definition.

Lemma 5 (Hoffman lemma [Shapiro et al., 2014]).

Let $S(b)$ be a polyhedron parameterized by the right-hand side vector $b\in\mathbb{R}^{m}$ of a given linear system, that is, $S(b)=\{x\in\mathbb{R}^{d}\mid Ax\leq b\}$ . Let $b^{*}\in\mathbb{R}^{m}$ be a vector such that $S(b^{*})$ is nonempty. Then, there exists $r>0$ depending only on $A$ such that

[TABLE]

where $\Delta$ is the unit ball, i.e., $\Delta=\{\varepsilon\in\mathbb{R}^{d}\mid\|\varepsilon\|\leq 1\}$ .

With this, we can guarantee that the optimal value function of a minimization problem of a Lipschitz function over a given polyhedron is also Lipschitz in its essential domain. The same holds for a maximization problem because if $f$ is a Lipschitz function then so is $-f$ , then, switching from maximization to minimization, Theorem 6 allows us to conclude the same result about the corresponding optimal value function $-\nu(\cdot)$ .

Theorem 6 (Lipschitz cost-to-go functions).

Let $\nu(\cdot)$ be the optimal value function defined by

[TABLE]

where the set $P$ is a polyhedron in $\mathbb{R}^{d+m}$ , the function $f(\cdot)$ is Lipschitz continuous with constant $L$ , and we assume that there is one value of $b$ such that $\nu(b)$ is finite. Then, the essential domain of $\nu(\cdot)$ , $\operatorname{dom}(\nu)$ , is a polyhedron, and the function $\nu(\cdot)$ restricted to $\operatorname{dom}(\nu)$ is Lipschitz continuous with constant $L\cdot\widetilde{r}$ , where $\widetilde{r}$ is a constant that depends only on $P$ .

Proof.

First, we prove that $\operatorname{dom}(\nu)$ is a polyhedron. Recall that $\operatorname{dom}(\nu)$ is the set of vectors $b\in\mathbb{R}^{m}$ for which the problem (11) is feasible, that is, $\operatorname{dom}(\nu)=\left\{b\in\mathbb{R}^{m}\ \middle|\ \exists x\in\mathbb{R}^{d};(x,b)\in P,\ f(x)<+\infty\right\}$ . Since $f$ is continuous, $f$ does not assume $+\infty$ anywhere, so $\operatorname{dom}(\nu)$ is the projection of $P$ over the component $b$ :

[TABLE]

As the image of a polyhedron by a linear map is also a polyhedron, we conclude that $\operatorname{dom}(\nu)$ is a polyhedron in $\mathbb{R}^{m}$ .

Now, we need to prove that $\nu(\cdot)$ is Lipschitz continuous over $\operatorname{dom}(\nu)$ . Denote by $Wx+Tb\leq h$ the linear constraint that defines $P$ , that is, $P=\{(x,b)\in\mathbb{R}^{d+m}\mid Wx+Tb\leq h\}$ , and let $S(u)$ be the set given by the linear system $Wx\leq u$ . Now, let $b_{1}$ and $b_{2}$ be two points in the domain of $\nu$ . Taking a feasible point $(x_{1},b_{1})\in P$ for the problem defined by $b_{1}$ , and applying the Hoffman Lemma for $u:=h-Tb_{2}$ , we get that there is a feasible point $(x_{2},b_{2})\in P$ such that

[TABLE]

Therefore, by the Lipschitz hypothesis on $f$ ,

[TABLE]

Taking the infimum over $x_{1}\in P(b_{1})$ , we see that $\nu(b_{2})\leq\nu(b_{1})+Lr{\left\lVert T\right\rVert}\cdot{\left\lVert b_{2}-b_{1}\right\rVert}$ . If $\nu(b_{2})$ is not $-\infty$ , this shows that $\nu(b_{1})>-\infty$ as well, so $\nu$ never assumes the value $-\infty$ .

Finally, by symmetry, we obtain ${\left\lvert\nu(b_{2})-\nu(b_{1})\right\rvert}\leq Lr{\left\lVert T\right\rVert}\cdot{\left\lVert b_{2}-b_{1}\right\rVert}$ , which is the Lipschitz condition for $\nu$ . ∎

To handle the Stochastic MILP case, it would be convenient if the minimum of Lipschitz functions over a union of polyhedra was Lipschitz as well. Indeed, one could split the optimization variable $x=(y,z)$ over the integer $z$ and continuous variables $y$ and obtain

[TABLE]

Since each function $\nu_{z}(b)=\min_{y\in P(b,z)}f(y,z)$ is Lipschitz continuous by Theorem 6 above, we’d be done. However, this is not true in general, because the domains of each $\nu_{z}$ may be different.

Indeed, we illustrate in figures 4a and 4b an example of a discontinuous optimal value function induced by the problem of minimizing a linear objective function over the union of two polyhedra. In this particular example, the feasible set is the intersection between the blue dashed line and the union of both vertical and horizontal rectangles, while the objective function is a linear function that decreases as the solution candidate moves to the left. We show in figures 4a and 4b the optimal solution of this problem for two different right-hand side parameters, which control the height of the dashed line. As that parameter changes, the dashed line moves up or down, and the optimal solution changes abruptly as soon as a point in the horizontal rectangle becomes feasible, as occurs in figure 4b. This shows that the optimal value function is discontinuous, so it cannot be Lipschitz continuous.

In order to guarantee the Lipschitz condition for the optimal value function, we need to assume an additional hypothesis about the union of polyhedra that defines the feasible set of the optimization problem with Lipschitz objective. Our typical optimization problem is

[TABLE]

where $I$ is a finite index set, and $P_{i}$ is a polyhedron for each $i\in I$ . One sufficient condition for the Lipschitz continuity of $\nu(b)$ is to assume that $f$ is a Lipschitz continuous function in $\operatorname{proj}_{y}(\bigcup_{i\in I}P_{i})$ and that $\operatorname{proj}_{b}(P_{i})$ equals $\mathbb{R}^{m}$ for each $i\in I$ . This is called the Complete Continuous Recourse (CCR) condition, compare [Zou et al., 2018, Definition 1]. Indeed, under the CCR assumption, each optimal value function $\nu_{i}(b)$ defined by the optimization problem

[TABLE]

is Lipschitz continuous in $\mathbb{R}^{m}$ (Theorem 6) and since $\nu(b)$ equals $\min_{i\in I}\nu_{i}(b)$ we conclude that $\nu(b)$ is also Lipschitz continuous in $\mathbb{R}^{m}$ .

3 The Stochastic Lipschitz Dynamic Programming algorithm

In this section, we present the Stochastic Lipschitz Dynamic Programming (SLDP) algorithm for the multistage case in two different approaches: the full scenario and the sampled scenario case. In the full scenario approach, all the nodes are visited in the forward and backward steps, and Lipschitz cuts are added for every expected cost-to-go function. In contrast, in the sampled scenario approach, just the sampled scenarios are visited in the forward and backward steps, and Lipschitz cuts are added only for the expected cost-to-go functions of the sampled nodes. We prove convergence to an optimal policy for the full scenario case, and we introduce an additional procedure in the sampled scenario case to ensure convergence towards an $\varepsilon$ -optimal policy.

3.1 Multistage setting and Lipschitz continuity of cost-to-go functions

For dealing with the stochastic multistage setting, we fix some notation for describing the scenario tree. Let $\mathcal{N}$ be the set of nodes, where $1$ is the root node, and $a:\mathcal{N}\backslash\{1\}\rightarrow\mathcal{N}$ is the ancestor function that associates each node $n$ except the root to its ancestor node $a(n)\in\mathcal{N}$ . We also define the set $\mathcal{S}(n)$ of successor nodes as the set of nodes with ancestor $n$ , that is, $\mathcal{S}(n)=\{m\in\mathbb{N}\mid a(m)=n\}$ , and for each successor $m$ there is a transition probability denoted by $q_{nm}$ from node $n$ to $m$ . From $1$ and $\mathcal{S}$ , we define the stage $t(n)$ of a node $n\in\mathbb{N}$ : the root node belongs to stage $1$ , and inductively the nodes in $S(n)$ belong to the stage $t(n)+1$ . The set of all nodes in stage $\tau$ is denoted by $\mathcal{N}_{\tau}$ .

In the dynamic programming formulation of stochastic optimization, we have a state variable $x_{n}$ and a mixed integer control variable $y_{n}$ , ranging over a feasible set $X_{n}$ and incurring an immediate cost $f_{n}(x_{n},y_{n})$ . As it is standard, we introduce a copy variable $z_{n}$ that carries the information from the previous state, so that the cost-to-go and expected cost-to-go functions $Q_{n}(\cdot)$ and $\overline{Q}_{n}(\cdot)$ of each node $n\in\mathcal{N}\backslash\{1\}$ satisfy the following recursive relationship:

[TABLE]

The nodes $n$ without successor are called leaf nodes of the tree, and they correspond to the last decision to be taken in the planning horizon. Also, observe that the root node $1\in\mathcal{N}$ does not have an ancestor, so we can still define ${\overline{Q}}_{1}(x_{1})$ by (18), but its stage problem (17) should be written as

[TABLE]

However, in order to avoid having to single out this special case, we slightly abuse notation by fixing $0=a(1)$ , $x_{0}=0$ , and extend $X_{1}$ to a further dimension $z_{1}$ .

From our discussion in the previous section, we assume that $f_{n}$ is Lipschitz continuuos with constant $L_{n}$ and that $X_{n}$ satisfies the complete continuous recourse condition. Under those hypothesis, both the cost-to-go functions $Q_{n}(\cdot)$ , and the expected cost-to-go functions $\overline{Q}_{n}(\cdot)$ are Lipschitz continuous.

Proposition 7 (Stochastic multistage MILP programs).

Consider the stochastic multistage MILP program defined by (17) and suppose that for every node $n\in\mathcal{N}$ the cost-to-go function $Q_{n}$ is not equal to $-\infty$ in any point, i.e., $Q_{n}(\cdot)>-\infty$ , and the CCR condition holds for the feasible set $X_{n}$ . Then, the expected cost-to-go function $\overline{Q}_{n}(\cdot)$ is Lipschitz continuous in $\mathbb{R}^{d_{n}}$ with Lipschitz constant at most

[TABLE]

where $r_{m}$ is a constant that depends only on $X_{m}$ .

Proof.

We proceed by backward induction on the scenario tree. For the leaf nodes, the statement holds by definition, since $\overline{Q}_{n}(\cdot)$ is identically zero.

So, suppose this result holds for all successor nodes $m\in\mathcal{S}(n)$ , and let’s prove that it also holds for node $n$ . By the induction hypothesis, the expected cost-to-go functions $\overline{Q}_{m}(\cdot)$ are Lipschitz with constant $\widetilde{L}_{m}$ , and from Theorem 6 each $Q_{m}(\cdot)$ is Lipschitz with constant $(L_{m}+\widetilde{L}_{m})\cdot r_{m}$ , where $L_{m}$ is the Lipschitz constant of the objective function $f_{m}$ and $r_{m}$ is a constant from the Hoffman Lemma that only depends on $X_{m}$ . Since the expected value of Lipschitz functions is also Lipschitz with constant equal to the expected value constant, the induction step is proved. ∎

Since problem (17) for each node admits the Lipschitz decomposable structure of (2), one could imagine using the reverse-norm method of Algorithm 1, or augmented Lagrangian cuts, to approximate its solution. However, in the multistage case we lack one fundamental property we used, namely that we can compute exactly the ‘complex’ function $g(x)$ , which in this case is $\overline{Q}_{n}(x_{n})$ . Indeed, we’re only able to produce lower approximations for it, and the next sections will deal with the necessary estimates to prove convergence under this weaker hypothesis.

3.2 Approximating the value functions

Before we present the SLDP algorithm, we need to introduce some notation for the approximations along the iterations of the algorithm. As usual, we denote by $\overline{\mathfrak{Q}}_{n}^{k}(x_{n})$ the expected cost-to-go approximation induced by the Lipschitz cuts at iteration $k$ . For the purpose of convergence analysis, we consider the approximations $Q_{n}^{k}(x_{a(n)})$ and $\overline{Q}_{n}^{k}(x_{a(n)})$ of the cost-to-go and expected cost-to-go functions at iteration $k$ defined below:

[TABLE]

We assume we are given, for the first iteration, a Lipschitz lower approximation ${\overline{\mathfrak{Q}}}_{n}^{1}(\cdot)$ of the expected cost-to-go function $\overline{Q}_{n}(\cdot)$ for each node $n$ in the tree. In practice, the first expected cost-to-go approximations are identically zero, since costs are usually non-negative.

Then, we update the expected cost-to-go approximation $\overline{\mathfrak{Q}}_{n}^{k}(x_{n})$ of iteration $k$ at a given point $x_{n}^{k}$ using the reverse-norm cut $\overline{Q}_{n}^{k+1}(x_{n}^{k})-\rho_{n}\cdot\|x_{n}-x_{n}^{k}\|$ :

[TABLE]

where $\rho_{n}>0$ is any constant greater than or equal to the Lipschitz constant $\widetilde{L}_{n}$ defined on (19). Note that we have used $\overline{Q}_{n}^{k+1}(x_{n}^{k})$ instead of $\overline{Q}_{n}^{k}(x_{n}^{k})$ in the cut update (25) because the Lipschitz cuts of the SLDP are updated from the last to the first stage, so all expected cost-to-go approximations $\overline{\mathfrak{Q}}_{m}^{k}(x_{m})$ of the successor nodes $m$ of $n$ are updated to $\overline{\mathfrak{Q}}_{m}^{k+1}(x_{m})$ before the computation of the optimal value (23) with state $x_{a(m)}=x_{n}^{k}$ . So, given each node $m$ and iteration $k$ we obtain in the backward step the cost-to-go approximation $Q_{m}^{k+1}(x_{a(m)})$ evaluated at $x_{n}^{k}$ , and since the expected cost-to-go approximation is the corresponding weighted average, we obtain $\overline{Q}_{n}^{k+1}(x_{n}^{k})$ for the Lipschitz cut (25).

There are some important comments about the concepts introduced so far. First, we note that the sequence $\{\overline{\mathfrak{Q}}_{n}^{k}\}_{k\in\mathbb{Z}_{+}}$ is a non-decreasing sequence of functions, and since it belongs to the objective function of (23), we conclude that the sequence of cost-to-go function approximations $\{Q_{n}^{k}\}_{k\in\mathbb{Z}_{+}}$ is also non-decreasing. Second, the expected cost-to-go approximation $\overline{Q}_{n}^{k}(x_{n})$ defined in (24) is a weighted average of non-decreasing functions ${Q}_{m}^{k}$ , so the corresponding sequence $\{\overline{Q}_{n}^{k}\}_{k\in\mathbb{Z}_{+}}$ is also non-decreasing. Third, the function $\overline{Q}_{n}^{k}$ plays an important role in the convergence analysis of the SLDP since the cuts of (25) are tight for $\overline{Q}_{n}^{k+1}(x_{n})$ at the forward solution $x_{n}^{k}$ . Last, the quality of the expected cost-to-go approximation $\overline{Q}_{n}^{k}$ of a given node $n\in\mathcal{N}$ depends on the quality of those approximations at the successor nodes $m\in\mathcal{S}(n)$ , so this explains the reason of computing Lipschitz cuts from the last to first stage.

We will prove convergence of the full scenario (resp. sampled scenario) SLDP algorithm by proving that the sequence of feasible policies $\big{(}x_{n}^{k}\big{)}_{n\in\mathcal{N}}$ produced by the algorithm converges to an optimal ( $\varepsilon$ -optimal) policy $\big{(}x_{n}^{*}\big{)}_{n\in\mathcal{N}}$ . As in Lemma 1, we show that $\overline{Q}_{n}^{k+1}(x_{n}^{k})$ converges to (and $\varepsilon$ -approximation of) $\overline{Q}_{n}(x_{n}^{*})$ where $\overline{Q}_{n}$ is the true expected cost-to-go function (18). In the examples of section 4, we will see that both expected cost-to-go approximations $\overline{Q}_{n}^{k}$ and $\overline{\mathfrak{Q}}_{n}^{k}$ approximate the true expected cost-to-go function $\overline{Q}_{n}$ in a neighborhood of the optimal ( $\varepsilon$ -optimal) policy solution $x_{n}^{*}$ , however those approximations are usually poor elsewhere.

To simplify our statements and make the logic in the proofs easier to follow, we will assume that the starting lower bounds ${\overline{\mathfrak{Q}}}_{n}^{1}$ are also valid lower bounds for ${\overline{Q}}_{n}^{1}$ , which imposes a compatibility constraint between ${\overline{\mathfrak{Q}}}_{n}^{1}$ and its successors ${\overline{\mathfrak{Q}}}_{m}^{1}$ for $m\in\mathcal{S}(n)$ . If one doesn’t have this property, then only the inequalities ${\overline{\mathfrak{Q}}}_{n}^{k}\leq{\overline{Q}}_{n}$ and ${\overline{Q}}_{n}^{k}\leq{\overline{Q}}_{n}$ are ensured in the following Lemma. Again, in common situations where costs are positive and all ${\overline{\mathfrak{Q}}}_{n}^{1}=0$ , this is immediately satisfied.

By the definition of cost-to-go and expected cost-to-go approximations, they form a monotone sequence of valid lower bounds for the true expected cost-to-go functions:

Lemma 8.

Consider the stochastic multistage MILP program (17) satisfying the CCR condition, and let $Q_{n}^{k}$ , $\overline{Q}_{n}^{k}$ and $\overline{\mathfrak{Q}}_{n}^{k}$ be the cost-to-go and expected cost-to-go approximations of (23), (24) and (25). Then

[TABLE]

for every node $n\in\mathcal{N}$ and iteration $k\in\mathbb{Z}_{+}$ .

Proof.

We proceed by backward induction on the tree. For leaf nodes, inequality (26) holds because $\overline{Q}_{n}^{k}$ and $\overline{Q}_{n}$ are identically zero, by definition. Suppose that inequality (26) holds for every successor node $m\in\mathcal{S}(n)$ at iteration $k$ . By the induction hypothesis, the function $\overline{\mathfrak{Q}}_{m}^{k}$ is less than or equal to $\overline{Q}_{m}$ , and by the optimization problems (17) and (23) we conclude that the cost-to-go approximation $Q_{m}^{k}(\cdot)$ is less than or equal to the true cost-to-go function $Q_{m}$ . Since we guarantee this property for every successor node $m$ , we get the same inequality for their respective weighted averages $\overline{Q}_{n}^{k}$ and $\overline{Q}_{n}$ .

Now, let’s prove that $\overline{\mathfrak{Q}}_{n}^{k}$ is less than or equal to $\overline{Q}_{n}^{k}$ by induction on the iteration $k$ . In the first iteration, the cost-to-go approximation $\overline{\mathfrak{Q}}_{n}^{1}$ is less than or equal to $\overline{Q}_{n}^{1}$ by hypothesis. Suppose that $\overline{\mathfrak{Q}}_{n}^{j}$ is less than or equal to $\overline{Q}_{n}^{j}$ for every iteration $j$ less than $k$ . We will prove that such inequality also holds for iteration $k$ . Indeed, by the induction hypothesis and the non-decreasing property of $\{\overline{Q}_{n}^{k}\}_{k\in\mathbb{Z}_{+}}$ , we have the following inequalities:

[TABLE]

Using the updating formula (25), we conclude that $\overline{\mathfrak{Q}}_{n}^{k}$ is less than or equal to $\overline{Q}_{n}^{k}$ because the reverse-norm cut is also a lower bound for $\overline{Q}_{n}^{k}$ . ∎

Throughout this paper we assume the CCR condition for the true stochastic multistage MILP program (17). Additionally, we also require the set of feasible policy’s states $\operatorname{proj}_{x}X_{n}$ to be compact, and we name the resulting assumption as the Compact State Complete Continuous Recourse (CS-CCR) condition.

3.3 Full scenario approach

The SLDP algorithm for the full scenario approach is analogous to the Nested Cutting Plane algorithm, but with Lispchitz cuts instead of linear cuts. As described in Algorithm 2, starting from a valid lower bound $M_{n}$ for all cost-to-go functions, and an upper bound $\rho_{n}$ for their Lipschitz constants, we improve the lower bounds near the candidate optimal solutions of each iteration. Thus, in the forward step, the full scenario SLDP algorithm solves the optimization problems (23) from the root to the leaves, that is, in ascending order of stages, and obtains feasible state and control variables $(x_{n},y_{n},z_{n})\in X_{n}$ for each node of the scenario tree. Then, in the backward step, it updates from the leaves to the root the expected cost-to-go approximation $\overline{\mathfrak{Q}}_{n}^{k}$ using formula (25) and the Lipschitz cuts centered at the states obtained in the forward step.

We have not provided a stopping criterion for Algorithm 2. Although in the full scenario case we could have chosen a criterion equivalent to the one in Algorithm 1, in the sampled case one would need to compute the optimal solution at every node of the scenario tree to have a deterministic upper bound for the optimal policy, which is unrealistic. So, we preferred to emphasize the similarities between the full and sampled scenario cases. and present their convergence results only in asymptotic form.

In order to simplify the notation and improve readability, we will assume that a variable $x$ or a vector $(x,y,z)$ that do not have the subscript $n$ is the vector composed by the corresponding variables or vectors for all nodes:

•

$x:=(x_{n})_{n\in\mathcal{N}}$ ;

•

$(x,y,z):=\Big{(}(x_{n},y_{n},z_{n})\Big{)}_{n\in\mathcal{N}}$ .

We refer to $(x,y,z)$ as a policy and $x$ as the policy’s states, and we denote by $\mathbb{X}$ the set of feasible policies and by $\operatorname{proj}_{x}\mathbb{X}$ the projection of $\mathbb{X}$ in the policy’s states.

By analogy with the proof of the (deterministic) reverse-norm method in Lemma 1 and Theorem 3, we start proving that the expected cost-to-go approximation $\overline{\mathfrak{Q}}_{n}^{k}$ approximates the true expected cost-to-go function $\overline{Q}_{n}$ in a neighborhood of any cluster state induced by the forward step.

Lemma 9.

Let $x^{*}\in\operatorname{proj}_{x}{\mathbb{X}}$ be a cluster point of the sequence of policy states $\{x^{k}\}_{k\in\mathbb{N}}$ generated by the forward step of Algorithm 2, and let $\mathcal{K}$ be the indices of a subsequence that converges to $x^{*}$ . Then $\{\overline{\mathfrak{Q}}_{n}^{k}(x_{n}^{k})\}_{k\in\mathcal{K}}$ converges to $\overline{Q}_{n}(x_{n}^{*})$ ,

[TABLE]

for every node $n\in\mathcal{N}$ .

Proof.

Let $\{(x^{k},y^{k},z^{k})\}_{k\in\mathbb{N}}$ be the sequence of policies obtained in the forward step of algorithm 2. By the compactness assumption of $\operatorname{proj}_{x}X_{n}$ , the set of feasible policy states $\operatorname{proj}_{x}\mathbb{X}$ is also compact, so there is a subsequence of $\{x^{k}\}_{k\in\mathbb{N}}$ that converges to a cluster point $x^{*}\in\operatorname{proj}_{x}\mathbb{X}$ . Denote by $\mathcal{K}$ the indices of this subsequence, that is, $\lim\limits_{k\in\mathcal{K}}x^{k}=x^{*}$ . We will show that equation (27) holds by backward induction on the tree. It trivially holds for the leaf nodes, since both functions $\overline{\mathfrak{Q}}_{n}$ and $\overline{Q}_{n}$ are identically zero, by hypothesis.

Now, suppose that equation (27) holds for every successor node $m\in\mathcal{S}(n)$ . From Lemma 8, we have an upper bound:

[TABLE]

which, by continuity of ${\overline{Q}}_{n}$ , yields:

[TABLE]

So we only need to prove that the lower approximations are large enough.

As in Lemma 1, we denote by $j$ the index in $\mathcal{K}$ immediately before $k$ . By monotonicity of the approximations, ${\overline{\mathfrak{Q}}}_{n}^{k}$ is larger than all of the Lipschitz cuts constructed, in particular the one at iteration $j$ . Therefore,

[TABLE]

Note that, differently from Lemma 1, we don’t obtain the exact expected cost-to-go function ${\overline{Q}}_{n}$ , but only its approximation ${\overline{Q}}_{n}^{j+1}$ . That’s why our proof splits in two parts: one bounding the difference between ${\overline{Q}}_{n}^{j}$ and ${\overline{\mathfrak{Q}}}_{n}^{k}$ , and the other bounding the one between ${\overline{Q}}_{n}$ and ${\overline{Q}}_{n}^{j}$ . Let’s complete the first one, which we already started. Since ${\overline{Q}}_{n}^{j}$ is an increasing sequence, ${\overline{Q}}_{n}^{j}\leq{\overline{Q}}_{n}^{j+1}$ , and we obtain

[TABLE]

To show that ${\overline{Q}}_{n}$ and ${\overline{Q}}_{n}^{j}$ are close, we use their definitions in (24) and (18):

[TABLE]

where the inequality follows because $(x_{m}^{j},y_{m}^{j},z_{m}^{j})$ are optimal solutions to (23) and feasible solutions to (17) for each $m\in\mathcal{S}(n)$ . Taking (29) and (30) together and rearranging terms we get

[TABLE]

Now, take the limit as $k$ goes to $\infty$ , which also makes $j$ grow to $\infty$ , and both $x^{k}$ and $x^{j}$ converge to $x^{*}$ . Since the expected cost-to-go function ${\overline{Q}}_{n}$ is continuous, we obtain:

[TABLE]

because both residual terms vanish in the limit, the second one going to zero by our induction hypothesis. Together with the upper bound from equation (28), this shows that the limit exists and concludes our proof. ∎

As a consequence of Lemmas 8 and 9, the expected cost-to-go approximation $\overline{Q}_{n}^{k}$ also approximates the true expected cost-to-go function $\overline{Q}_{n}$ in a neighborhood of any cluster point of the sequence of feasible policy’s states induced by the forward step of the full scenario SLDP. Using the argument that leads to inequality (30) in the proof of Lemma 9 we get that the cost-to-go approximation $Q_{n}^{k}$ also approximates the true cost-to-go function $Q_{n}$ in a neighborhood of any cluster policy state. That is, the following limits also hold:

[TABLE]

for any convergent subsequence $\{x^{k}\}_{k\in\mathcal{K}}$ of policy states induced by the forward step of the SLDP algorithm, and $x^{*}\in\operatorname{proj}_{x}\mathbb{X}$ the corresponding limit point.

Theorem 10.

The sequence of lower bounds $\{Q_{1}^{k}\}_{k\in\mathbb{N}}$ induced by the SLDP algorithm 2 converges to the optimal value $Q_{1}$ of the true stochastic multistage MILP program (17), and every cluster point of the sequence of feasible policies $\{(x^{k},y^{k},z^{k})\}_{k\in\mathbb{N}}$ generated by the forward step of Algorithm 2 is an optimal policy.

Proof.

Let $\{(x^{k},y^{k},z^{k})\}_{k\in\mathbb{N}}$ be the sequence of feasible policies generated by the full scenario SLDP algorithm 2. Let $\mathcal{K}$ be the set of indices of a convergent subsequence of policy states $\{x^{k}\}_{k\in\mathbb{N}}$ , and let $x^{*}$ be the corresponding limit point, which exists by compactness of $\operatorname{proj}_{x}\mathbb{X}$ . As for equation (30), at the root node we have

[TABLE]

because $(x_{1}^{k},y_{1}^{k},z_{1}^{k})$ is a feasible solution to the optimization problem whose optimal value is $Q_{1}$ and optimal solution to that whose optimal value is $Q_{1}^{k}$ . Using Lemma 9, we conclude the convergence of the subsequence $\{Q_{1}^{k}\}_{k\in\mathcal{K}}$ to $Q_{1}$ . Since the whole sequence $\{Q_{1}^{k}\}_{k\in\mathbb{N}}$ is non-decreasing, we get convergence to $Q_{1}$ .

Now, suppose that there is a cluster point $(x,y,z)$ of the sequence of feasible policies $\{(x^{k},y^{k},z^{k})\}_{k\in\mathbb{N}}$ , and denote also by $\mathcal{K}$ the set of indices of the corresponding subsequence. In order to prove that $(x,y,z)$ is an optimal policy, we need to show that its components are optimal solutions to the optimization problem of each node $n\in\mathcal{N}$ whose optimal value is $Q_{n}(x_{a(n)})$ . We will proceed by forward induction on the tree. Indeed, we have just shown that $(x_{1},y_{1},z_{1})$ is an optimal solution at the root node. Now, assume that this result holds for the ancestor node $a(n)$ . Using the same argument as before, we have the following inequalities:

[TABLE]

So, taking the limit over $\mathcal{K}$ on both sides of the inequality and using Lemma 9, we conclude that $(x_{n},y_{n},x_{a(n)})$ is an optimal solution of the optimization problem whose optimal value is $Q_{n}(x_{a(n)})$ . ∎

Just as it was the case for the proof of both Lemma 1 and theorem 3, we again just use the same properties of the reverse-norm cuts, namely that they are uniformly Lipschitz and that we are able to construct exact cuts at trial points, for the approximate future cost function ${\overline{Q}}_{n}^{k}$ . As before, this shows that any method of producing uniformly Lipschitz tight cuts in the nested form of stochastic optimization problems will result in a convergent algorithm on the full scenario approach. In particular, one can use the augmented Lagrangian cuts from section 2.3 provided one takes $\rho_{n}$ large enough that the resulting cut is exact.

3.4 Sampled tree approach

In multistage stochastic programming problems with a reasonable number of stages, it is computationally intractable to visit every node of the scenario tree. So, one needs to sample paths on the scenario tree and iteratively approximate the expected cost-to-go functions at each stage to obtain a “reasonable” solution. In this paper, we focus on the sampling scheme of one random path per forward iteration, but its conversion to more general schemes is straightforward.

We emphasize that a path on the scenario tree is chosen at random, so a node $n$ may not belong to the path of some forward step iterations. Let $\mathcal{J}_{n}$ be the set of iterations $k$ of the Sampled-SLDP for which the path of the forward step contains the node $n$ . Note that $\mathcal{J}_{n}$ is a random set, since it depends on each experiment $\omega\in\Omega$ , and the probability of node $n$ being draw an infinite number of times equals one, i.e.,

[TABLE]

by the Borel-Cantelli Lemma. We will assume a realization of the sampling where this is the case, to avoid repeating “with probability one” in what follows.

Also, observe that the collection of sets $\{\mathcal{J}_{m}\mid m\in\mathcal{S}(n)\}$ induced by the successor nodes covers $\mathcal{J}_{n}$ , that is

[TABLE]

since a path that contains a node $n$ also contains some successor node $m$ . In the deterministic case, the set of iterations $\mathcal{J}_{n}$ equals $\mathbb{Z}_{+}$ for every node $n$ , since all nodes are visited in the forward step. In the analysis of the Sampled-SLDP algorithm, we need to refer to optimal solutions of nodes that do not belong to a given forward path, even if in practice they are not computed. We still use the same notation $(x_{n}^{k},y_{n}^{k},z_{n}^{k})$ to refer to an optimal solution of node $n$ and iteration $k$ .

Following the same organization of the previous sections, we would like to prove that for each node $n$ there is a subset $\mathcal{K}_{n}$ of $\mathcal{J}_{n}$ such that the following limit holds:

[TABLE]

where $\{x_{n}^{k}\}_{k\in\mathcal{K}_{n}}$ is a subsequence of policy states converging to a limit point $x_{n}^{*}$ . However, the main obstacle of this lemma is the induction step, since we need to control the difference between $\overline{Q}_{n}(x_{n}^{k})$ and $\overline{Q}_{n}^{k}(x_{n}^{k})$ using inequality (30) or some variation, as in the proof of Lemma 9. Inequality (30) directly is not suitable for the proof, because there we implicitly used that $\mathcal{K}_{m}$ equals $\mathcal{K}_{n}$ for every successor node $m$ to be able to use the induction hypothesis (31).

In order to ensure convergence of the Sampled-SLDP algorithm, we consider an additional step to stabilize the policy states obtained in the forward step. Instead of computing reverse norm cuts at every new forward solution $x_{n}^{k}$ , we check if the new feasible point is more than $\delta>0$ away from all previous forward solutions $x_{n}^{1},\dots,x_{n}^{k-1}$ . If this is the case, then we update the expected cost-to-go function $\overline{\mathfrak{Q}}_{n}^{k}(\cdot)$ with the reverse norm cut centered at the new policy state $x_{n}^{k}$ ; otherwise we improve it at the closest previous forward solution $x_{n}^{j}$ , see Algorithm 3. Note that after a finite number of iterations the forward incoming state $x_{n}^{k}$ becomes trapped in a finite number of possibilities, since node $n$ will be visited an infinite number of times and the set of feasible policy states $\operatorname{proj}_{x}X_{n}$ is compact. We also show in Lemma 11 that the expected cost-to-go approximation $\overline{\mathfrak{Q}}_{n}^{k}$ converges in a finite number of iterations to a Lipschitz function $\overline{\mathfrak{U}}_{n}$ , which is an $\varepsilon$ -approximation of the true expected cost-to-go function $\overline{Q}_{n}$ at any cluster point $x_{n}^{*}$ .

Lemma 11.

With probability one, the sequence of expected cost-to-go approximations $\{\overline{\mathfrak{Q}}_{n}^{k}\}_{k\in\mathbb{N}}$ generated by Algorithm 3 converges to a Lipschitz function $\overline{\mathfrak{U}}_{n}$ with constant $\widetilde{L}_{n}$ after a finite number of iterations. Moreover, the following relationships hold for every node $n$ of the tree:

[TABLE]

where $\mathcal{K}_{n}$ is a subset of indices from $\mathcal{J}_{n}$ such that the sequence of policy states $\{x_{n}^{k}\}_{k\in\mathcal{K}_{n}}$ converges, $x_{n}^{*}$ is the corresponding limit point, $\widetilde{L}$ is the maximum Lipschitz constant $\widetilde{L}_{n}$ and $\rho$ is the maximum penalty constant $\rho_{n}$ over all nodes $n\in\mathcal{N}$ .

Proof.

We start proving the finite convergence of ${\overline{\mathfrak{Q}}}_{n}^{k}$ to ${\overline{\mathfrak{U}}}_{n}$ , and the limit in (32) by backward induction on the tree. In the last stage this result is trivial since both functions $\overline{\mathfrak{Q}}_{n}^{k}$ and $\overline{Q}_{n}$ are identically zero.

Let $n$ be a node such that the statement (32) holds for every successor node $m\in\mathcal{S}(n)$ . Recall that the updating rule of the reverse norm cut has the form:

[TABLE]

where the expected cost-to-go approximation $\overline{Q}_{n}^{k+1}$ is the weighted average of the cost-to-go approximations $Q_{m}^{k}$ over the successor nodes $m\in\mathcal{S}(n)$ . By the induction hypothesis, after a finite number of iterations we obtain

[TABLE]

In other words, both the cost-to-go $Q_{m}^{k}$ and the expected cost-to-go $\overline{Q}_{n}^{k}$ approximations stabilize after a finite number of iterations, so denote by $U_{m}$ and $\overline{U}_{n}$ the corresponding limits, respectively. Since the number of different incoming states $x_{n}^{k}$ is also finite, this implies that the number of different possible reverse norm cuts to update $\overline{\mathfrak{Q}}_{n}^{k}$ is also finite. Then, $\overline{\mathfrak{Q}}_{n}^{k}$ converges to a function $\overline{\mathfrak{U}}_{n}$ in a finite number of iterations.

Now, let’s prove inequality (33) by backward induction on the tree. It is trivial at the leaf nodes, so suppose inequality (33) holds for every successor node $m\in\mathcal{S}(n)$ . Since there is a finite number of different possible policy states at the node $n$ , the sequence $\{x_{n}^{k}\}_{k\in\mathcal{K}_{n}}$ converges to $x_{n}^{*}$ in a finite number of iterations, which means that the reverse norm cut $\overline{U}_{n}(x_{n}^{*})-\rho_{n}\cdot\|x_{n}-x_{n}^{*}\|$ is also considered in the expected cost-to-go limit $\overline{\mathfrak{U}}_{n}$ . In particular, we have the following inequalities:

[TABLE]

where the last inequality results from Lemma 8. But the expected cost-to-go approximations $\overline{\mathfrak{U}}_{n}$ and $\overline{U}_{n}$ are equal at $x_{n}^{*}$ , so we obtain the following equation for the difference between $\overline{\mathfrak{U}}_{n}$ and the true cost-to-go function $\overline{Q}_{n}$ :

[TABLE]

Now, we have the crucial part of the argument. Because the expected cost-to-go approximations of all nodes stabilize, every incoming state of any successor node $m\in\mathcal{S}(n)$ equals $x_{n}^{*}$ after a large number of iterations. Then, the optimal solution of node $m$ with input state $x_{n}^{*}$ is equal to $u_{m}$ , which is less than $\delta$ away from the final state $x_{m}^{*}$ of node $m$ , by the design of the Sampled-SLDP algorithm. Then, we obtain the following inequalities:

[TABLE]

where the first inequality results from $u_{m}$ being the optimal policy’s state of node $m$ with input state $x_{n}^{*}$ , and the following ones from the Lipschitz property of $\overline{Q}_{m}$ and $\overline{\mathfrak{U}}_{m}$ , respectively. By our induction hypothesis,

[TABLE]

and since $t(m)=t(n)+1$ we get

[TABLE]

because $u_{m}$ and $x_{m}^{*}$ are at most $\delta$ far away from each other. So, the upper bound (36) together with equation (35) concludes the induction step. ∎

Theorem 12.

With probability $1$ , the sequence of lower bounds $\{Q_{1}^{k}\}_{k\in\mathbb{N}}$ generated by Algorithm 3 converges in a finite number of iterations to an $\varepsilon$ -approximation of the true optimal value $Q_{1}$ , where $\varepsilon=(\widetilde{L}+\rho)\cdot\delta\cdot(T-1)$ , and every cluster point of the sequence of feasible policies $\{(x^{k},y^{k},z^{k})\}_{k\in\mathbb{N}}$ generated by the forward step of Algorithm 3 is an $\varepsilon$ -optimal policy.

Proof.

This is a straightforward result of Lemma 11, using the same reasoning as in Theorem 10. ∎

4 Examples

In this section, we will present two applications of the SLDP algorithm for stochastic optimization. The first is a simple example of a 1-dimensional dynamics with discrete control. Due to its symmetry and relative simplicity, it is possible to evaluate the cost-to-go functions, so that we can understand the behavior of the algorithm in its different forms. The second one has been extracted from [Carøe and Schultz, 1997] and [Ahmed et al., 2004], and is a 2-stage problem, for which enumeration can be performed in order to also evaluate the optimal solution and cost-to-go function.

4.1 Implementation details

The non-convex cuts used in SLDP are represented as inequalities of the form

[TABLE]

where $\lambda=0$ for the reverse-norm cuts, but is needed for the augmented Lagrangian cuts. To incorporate them in the stage problems, this requires choosing a norm, and a MIP formulation of this constraint. For the experiments below, we have used the $L^{1}$ norm

[TABLE]

and each term in (37) is given by the sum $(u_{j}^{+}+u_{j}^{-})$ from the following system:

[TABLE]

The constants $M_{j}$ are large enough so that $\operatorname{proj}_{x_{j}}(X)$ has diameter less than $M_{j}$ , which is ensured by the compactness assumption of $\operatorname{proj}_{x}(X)$ .

Observe that this formulation includes a binary variable (and two continuous variables), per dimension of $x$ , for each new non-convex cut we introduce. This makes each iteration of the SLDP algorithm much more expensive than previous ones.

One practical implementation of the SLDP method uses augmented Lagrangian cuts, and increases $\rho$ progressively. Since by construction the augmented Lagrangian cuts are valid, if $\rho$ is not large enough then the cuts might not be tight, but they might fill faster the non-convex regions of the cost-to-go function. Also, in analogy to Strengthened Benders cuts, it is possible to fix both the Lagrange multiplier and the augmenting term, and solve the resulting augmented Lagrangian relaxation. This again yields a valid cut, which we call strengthened augmented Benders cut.

All results below were obtained using Julia-0.6.3 [Bezanson et al., 2017] and the Julia packages SDDP.jl [Dowson and Kapelevich, 2017] and SDDiP.jl [Kapelevich, 2018], besides our own Julia implementation for both Lipschitz and strengthened augmented Benders cuts [Freitas Paulo da Costa, 2019], extending SDDP.jl. The computations were performed on an Intel(R) Xeon(R) CPU E5-2603 CPU.

4.2 A 1-dimensional control problem

We consider the following multistage control problem:

[TABLE]

The state variable $x_{t}$ is 1-dimensional, as the discrete control $c_{t}=\pm 1$ , and the uncertainty $\xi_{t}$ . The objective is to minimize the expected displacement away from zero, subject to a decay factor $\beta$ , over the planning horizon $T$ . We fix $T=8$ , $\beta=0.9$ , $x_{0}=2$ , and at each stage $t$ we consider 10 independent scenarios symmetrically sampled around 0.

We will compare the performance and the policy generated by several methods: a convex approximation using Strengthened Benders cuts (shortened as SB), the original SLDP algorithm using reverse-norm cuts (SLDP tents), a modified SLDP algorithm using ALD cuts with increasing augmentation $\rho$ (SLDP ALD), and the SDDiP algorithm [Zou et al., 2018], using two discretization steps: $0.1$ and $0.01$ . The resulting discretized problems for SDDiP don’t have complete continuous recourse, since the state cannot absove the noise below the discretization level, and we only have a discrete control, so we also add a slack variable and penalize it in the objective function. The original problem, with continuous state, doesn’t need adjustments.

We present in Table 1 the lower bounds, the estimated upper bounds using policy simulations and the computation times after 100 iterations for each method.

The convex approximations stall at a very low lower bound, while the non-convex methods all have better estimates, but they also need significantly more computation time. The SLDP approximations have a very similar performance — the ALD method requiring slightly more time. SDDiP has a relatively good performance with step $0.1$ , but not with $0.01$ . Observe that the higher lower bound for SDDiP with step $0.1$ also comes with a higher upper bound, which is due to the addition of the penalization term and a loose state discretization. When the discretization step is the smaller $0.01$ , the upper bounds of the simulation agree more closely with the other cases, but we spend $66\%$ more in computation time, and the lower bounds are much further away.

As we can see in figure 5, the future cost functions are nonconvex at all time stages, essentially driven by the discontinuous control $c_{t}$ : the immediate cost is $\min\{{\left\lvert u-1\right\rvert},{\left\lvert u+1\right\rvert}\}$ , where $u=x_{t-1}+\xi_{t}$ .

However, the future cost functions built by the convex Strengthened Benders cuts can’t pierce into the nonconvexities, and become flat over $[-1,1]$ , as depicted in figure 6. This explains why the convex approximations perform so poorly in this case.

The Lipschitz cuts, on the other hand, are indeed able to yield a better approximation of the problem, and approximate the expected cost-to-go functions inside their nonconvexities. The same happens with the value function obtained by SDDiP. In figure 7, we show a comparison of the expected cost-to-go functions thus constructed, using reverse-norm cuts, augmented Lagrangian cuts, SDDiP, and the actual future cost function.

One particular feature of this example is that the future cost functions are continuous in the original variables $x_{t}$ , but since the control is $c_{t}\in\{-1,1\}$ , the stage problems lack the continuous recourse property. For this reason, when one takes the SDDiP discretization of the state variable, one must also include a slack term to the state dynamics. This increases the costs overall, and explains why the value functions estimated with the “coarse” discretization with $\varepsilon=0.1$ are higher than the true expected cost-to-go functions ${\overline{Q}}_{t}$ . The lower bounds ${\overline{\mathfrak{Q}}}_{t}$ obtained with the “fine” discretization with $\varepsilon=0.01$ , on the other hand, “detach” much faster from the respective ${\overline{Q}}_{t}$ as we move back in the stages of the tree.

Finally, we compare in figure 8 the evolution of the lower bounds, both as iterations and time increase.

There, we see that both methods for SLDP were essentially equivalent, maybe except at the beginning, where the ALD’s $\rho$ was probably too small to yield good enough cuts. However, as iterations progressed, and $\rho$ was large enough, the algorithm quickly reached a comparable lower bound. We also include a curve for the time taken per iteration, to highlight the rapid increase in time for the SLDP algorithm, even in a 1-dimensional problem. This same phenomenon happens for SDDiP, on a much smaller scale, but it already starts out with a significantly larger iteration time.

4.3 A 2-dimensional example

We also study another example, taken from [Carøe and Schultz, 1997] and further adapted in [Ahmed et al., 2004]. This is a 2-stage problem, with cost-to-go function

[TABLE]

The random variable $\omega$ lies in $[5,15]^{2}$ , and is approximated using $N^{2}$ points on the square, where $N=2,3,6$ .

In [Carøe and Schultz, 1997], the authors consider the optimization problem with discrete first-stage decisions:

[TABLE]

The same value function for the second stage can be used for a continuous first-stage problem, as treated by [Ahmed et al., 2004], where one drops the constraint that the first-stage variables belong to $\mathbb{Z}$ .

In this case, the cost-to-go function is discontinuous in $x$ , so for both the SLDP and SDDiP we would need to add slack variables and their corresponding penalization in the objective function. Therefore, for this experiment we used only the convex approximations and the ALD cuts, which are valid despite the discontinuities of the value function. The convex approximations were calculated using 100 iterations (but stalled much before that), while the ALD approximations were carried for 200 iterations. This is more than the number of possible values for the state variable $x$ , in the discrete case.

We give in tables 2 and 3, respectively for the discrete and continuous first stage variables, the lower bounds and computation time required for each method. We also include the true objective for each value of $N$ . In all cases, the convex cuts using Strenghtened Benders are not able to close the gap, so we report the remaining gap for the SLDP-ALD as the ratio

[TABLE]

of the remaining gap. In the discrete case of table 2, we obtain exact results for $N=2$ and $N=3$ , and a significant reduction for the case $N=6$ . The fact that it does not converge is due to the need of having even larger values for $\rho$ to get tight cuts for every node in the second stage. The continuous case is harder, and we have gaps now at $N=3$ as well. Besides the difficulties of achieving tight cuts as in the discrete case, the algorithm also needs to explore several points in the neighborhood of the optimal solution.

It is remarkable that the times required by the Lipschitz cuts is much smaller in the discrete setting than in the continuous setting, contrary to what happens for the convex case, which solves a harder problem in the first stage and therefore takes slightly more time. This is probably explained by the SLDP algorithm constructing cuts at the same points, and therefore the resulting stage problems don’t become much more difficult as times passes, as opposed to the continuous case, where the nodes for each cut are probably different. Also note that all those times, and especially the ALD times, are much larger than the ones reported in [Ahmed et al., 2004]. Besides a different computational setting, we don’t explore the fact that the technology matrix is deterministic.

5 Conclusion

In this paper, we proposed a new algorithm for solving stochastic multistage MILP programming problems, called Stochastic Lipschitz Dynamic Programming (SLDP). Its major contribution is the inclusion of nonlinear cuts to iteratively underapproximate nonconvex Lipschitz cost-to-go functions. We explored two such families of cuts: (a) the ones induced by reverse-norm penalizations; and (b) augmented Lagrangian cuts, built from norm-augmented Lagrangian duality.

Assuming the Compact State Complete Continuous Recourse conditions, we proved convergence of the algorithm in the full scenario setting. In the sampled case, we provided an approximation method to reach $\varepsilon$ -optimal policies in finite time. Besides asymptotic convergence in the general Stochastic Multistage Lipschitz case, it would be interesting to prove a finite convergence result for Stochastic MILPs in the sampled case by further exploring the structure of the value functions in each stage.

Our experiments suggest that, at least for small-dimensional problems, the performance of the SLDP algorithm is reasonable.

Acknowledgements

This research has been supported in part by the National Science Foundation grant 1633196, the Office of Naval Research grant N00014-18-1-2075, the COPPETec project IM-21780.

The second author would like to express his gratitude to the Brazilian Independent System Operator (ONS) for its support for this research.

This research project was concluded while the third author was visiting Georgia Tech during a sabbatical leave from UFRJ. He would like to warmly thank the hospitality and the excelent environment of the ISyE institute.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Ackooij et al., 2018] Ackooij, W., Lopez, I. D., Frangioni, A., Lacalandra, F., and Tahanan, M. (2018). Large-scale unit commitment under uncertainty: an updated literature survey. Annals of Operations Research , 271(1):11–85.
2[Ahmed et al., 2004] Ahmed, S., Tawarmalani, M., and Sahinidis, N. V. (2004). A finite branch-and-bound algorithm for two-stage stochastic integer programs. Mathematical Programming, Series A , 100:355–377.
3[Bezanson et al., 2017] Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2017). Julia: A fresh approach to numerical computing. SIAM Review , 59(1):65–98.
4[Birge and Louveaux, 2011] Birge, J. R. and Louveaux, F. (2011). Introduction to stochastic programming . Springer Science & Business Media.
5[Carøe and Schultz, 1997] Carøe, C. C. and Schultz, R. (1997). Dual decomposition in stochastic integer programming. Operations Research Letters , 24:37–45.
6[Cerisola et al., 2012] Cerisola, S., Latorre, J. M., and Ramos, A. (2012). Stochastic dual dynamic programming applied to nonconvex hydrothermal models. European Journal of Operational Research , 218(3):687–697.
7[Conejo et al., 2016] Conejo, A. J., Morales, L. B., Kazempour, S. J., and Siddiqui, A. S. (2016). Investment in Electricity Generation and Transmission: Decision Making Under Uncertainty . Springer Publishing Company, Incorporated, 1st edition.
8[Costley et al., 2017] Costley, M., Feizollahi, M. J., Ahmed, S., and Grijalva, S. (2017). A rolling-horizon unit commitment framework with flexible periodicity. International Journal of Electrical Power & Energy Systems , 90.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Stochastic Lipschitz Dynamic Programming

Abstract

1 Introduction

2 Lipschitz value functions

2.1 Lipschitz functions and reverse norm cuts

Definition 1**.**

2.2 Optimization with reverse norm cuts

Lemma 1**.**

Proof.

Corollary 2**.**

Proof.

Theorem 3**.**

Proof.

2.3 Augmented Lagrangian cuts

Definition 2**.**

Theorem 4**.**

Definition 3**.**

2.4 MILPs with Lipschitz value functions

Lemma 5** (Hoffman lemma [Shapiro et al., 2014]).**

Theorem 6** (Lipschitz cost-to-go functions).**

Proof.

3 The Stochastic Lipschitz Dynamic Programming algorithm

3.1 Multistage setting and Lipschitz continuity of cost-to-go functions

Proposition 7** (Stochastic multistage MILP programs).**

Proof.

3.2 Approximating the value functions

Lemma 8**.**

Proof.

3.3 Full scenario approach

Lemma 9**.**

Proof.

Theorem 10**.**

Proof.

3.4 Sampled tree approach

Lemma 11**.**

Proof.

Theorem 12**.**

Proof.

4 Examples

4.1 Implementation details

4.2 A 1-dimensional control problem

4.3 A 2-dimensional example

5 Conclusion

Acknowledgements

Definition 1.

Lemma 1.

Corollary 2.

Theorem 3.

Definition 2.

Theorem 4.

Definition 3.

Lemma 5 (Hoffman lemma [Shapiro et al., 2014]).

Theorem 6 (Lipschitz cost-to-go functions).

Proposition 7 (Stochastic multistage MILP programs).

Lemma 8.

Lemma 9.

Theorem 10.

Lemma 11.

Theorem 12.