Learning continuous Q-Functions using generalized Benders cuts

Joseph Warrington

arXiv:1902.07664·math.OC·February 21, 2019·ECC

Learning continuous Q-Functions using generalized Benders cuts

Joseph Warrington

PDF

TL;DR

This paper introduces a model-based algorithm using generalized Benders cuts to approximate the optimal Q-function in continuous control problems, providing finite-iteration guarantees on Bellman error reduction.

Contribution

It presents a novel Benders-based method for continuous Q-function approximation with proven finite-iteration optimality guarantees.

Findings

01

Algorithm converges to arbitrarily small Bellman error in finite steps.

02

Guarantees hold for both fixed and online input selection scenarios.

03

Numerical experiments demonstrate effectiveness on scalar and high-dimensional systems.

Abstract

Q-functions are widely used in discrete-time learning and control to model future costs arising from a given control policy, when the initial state and input are given. Although some of their properties are understood, Q-functions generating optimal policies for continuous problems are usually hard to compute. Even when a system model is available, optimal control is generally difficult to achieve except in rare cases where an analytical solution happens to exist, or an explicit exact solution can be computed. It is typically necessary to discretize the state and action spaces, or parameterize the Q-function with a basis that can be hard to select a priori. This paper describes a model-based algorithm based on generalized Benders theory that yields ever-tighter outer-approximations of the optimal Q-function. Under a strong duality assumption, we prove that the algorithm yields an…

Figures10

Click any figure to enlarge with its caption.

Tables1

Table 1. TABLE I: Performance statistics (mean ± plus-or-minus \pm standard deviation) for Variant B of Algorithm 1 ; 20 random 8-state, 3-input systems.

	Iterations to	Computation	Number of
$M$	termination	time (s)	cuts added ¹
10	217.3 $\pm$ 57.9	0.313 $\pm$ 0.120	160.8 $\pm$ 31.7
20	485.7 $\pm$ 74.5	1.054 $\pm$ 0.296	356.0 $\pm$ 71.0
50	1293 $\pm$ 193	6.083 $\pm$ 1.724	907.0 $\pm$ 163.4
100	2736 $\pm$ 491	25.09 $\pm$ 8.60	1828 $\pm$ 367
200	6043 $\pm$ 1017	114.7 $\pm$ 34.5	3696 $\pm$ 626
500	16245 $\pm$ 2527	777.1 $\pm$ 215.3	9654 $\pm$ 1644

Equations81

V^{⋆} (x) := u_{0}, u_{1}, \dots in f

V^{⋆} (x) := u_{0}, u_{1}, \dots in f

x_{t + 1} = f (x_{t}, u_{t}), t = 0, 1, \dots,

h (x_{t}, u_{t}) \leq 0, t = 0, 1, \dots,

x_{0} = x .

Q^{π} (x, u) = ℓ (x, u) + t = 1 \sum \infty γ^{t} ℓ (x_{t}, π (x_{t})),

Q^{π} (x, u) = ℓ (x, u) + t = 1 \sum \infty γ^{t} ℓ (x_{t}, π (x_{t})),

Q^{⋆} (x, u) = ℓ (x, u) + u^{'} \in U (f (x, u)) in f Q^{⋆} (f (x, u), u^{'})

Q^{⋆} (x, u) = ℓ (x, u) + u^{'} \in U (f (x, u)) in f Q^{⋆} (f (x, u), u^{'})

T_{Q} Q (x, u) := ℓ (x, u) + u^{'} \in U (f (x, u)) in f Q (f (x, u), u^{'}) .

T_{Q} Q (x, u) := ℓ (x, u) + u^{'} \in U (f (x, u)) in f Q (f (x, u), u^{'}) .

π (x; Q) \in u \in U (x) ar g min Q (x, u) .

π (x; Q) \in u \in U (x) ar g min Q (x, u) .

Q_{a} (x, u) \leq Q_{b} (x, u) \forall (x, u) \in X \times U

Q_{a} (x, u) \leq Q_{b} (x, u) \forall (x, u) \in X \times U

\Rightarrow T_{Q} Q_{a} (x, u) \leq T_{Q} Q_{b} (x, u) \forall (x, u) \in X \times U .

Q_{I} (x, u) = i = 0, \dots, I max {q_{i} (x, u)},

Q_{I} (x, u) = i = 0, \dots, I max {q_{i} (x, u)},

q_{i} (x, u) \leq Q^{⋆} (x, u), \forall (x, u) \in X \times U .

q_{i} (x, u) \leq Q^{⋆} (x, u), \forall (x, u) \in X \times U .

q_{I + 1} (x, u)

q_{I + 1} (x, u)

and q_{I + 1} (\overset{x}{^}, \overset{u}{^})

T_{Q} Q_{I} (\overset{x}{^}, \overset{u}{^}) = x^{'}, u^{'} in f

T_{Q} Q_{I} (\overset{x}{^}, \overset{u}{^}) = x^{'}, u^{'} in f

x^{'} = f (\overset{x}{^}, \overset{u}{^}),

h (x^{'}, u^{'}) \leq 0,

T_{Q} Q_{I} (\overset{x}{^}, \overset{u}{^}) = x^{'}, u^{'}, α in f

T_{Q} Q_{I} (\overset{x}{^}, \overset{u}{^}) = x^{'}, u^{'}, α in f

x^{'} = f (\overset{x}{^}, \overset{u}{^}),

h (x^{'}, u^{'}) \leq 0,

q_{i} (x^{'}, u^{'}) \leq α, i = 0, \dots, I .

L (x^{'}, u^{'}, α, ν, λ_{c}, λ_{α}) := ℓ (\overset{x}{^}, \overset{u}{^}) + γ α + ν^{⊤} (f (\overset{x}{^}, \overset{u}{^}) - x^{'})

L (x^{'}, u^{'}, α, ν, λ_{c}, λ_{α}) := ℓ (\overset{x}{^}, \overset{u}{^}) + γ α + ν^{⊤} (f (\overset{x}{^}, \overset{u}{^}) - x^{'})

+ λ_{c}^{⊤} h (x^{'}, u^{'}) + i = 0 \sum I λ_{α, i} (q_{i} (x^{'}, u^{'}) - α) .

J_{D} (\overset{x}{^}, \overset{u}{^}) := ν, λ_{c}, λ_{α} sup

J_{D} (\overset{x}{^}, \overset{u}{^}) := ν, λ_{c}, λ_{α} sup

1^{⊤} λ_{α} = γ,

λ_{c} \geq 0, λ_{α} \geq 0,

ξ (ν, λ_{c}, λ_{α}) := x^{'}, u^{'} in f {- ν^{⊤} x^{'} + λ_{c}^{⊤} h (x^{'}, u^{'}) + i = 0 \sum I λ_{α, i} q_{i} (x^{'}, u^{'})}

ξ (ν, λ_{c}, λ_{α}) := x^{'}, u^{'} in f {- ν^{⊤} x^{'} + λ_{c}^{⊤} h (x^{'}, u^{'}) + i = 0 \sum I λ_{α, i} q_{i} (x^{'}, u^{'})}

q_{I + 1} (x, u) := ℓ (x, u) + \overset{ν}{^}^{⋆ ⊤} f (x, u) + ξ (\overset{ν}{^}^{⋆}, \hat{λ}_{c}^{⋆}, \hat{λ}_{α}^{⋆})

q_{I + 1} (x, u) := ℓ (x, u) + \overset{ν}{^}^{⋆ ⊤} f (x, u) + ξ (\overset{ν}{^}^{⋆}, \hat{λ}_{c}^{⋆}, \hat{λ}_{α}^{⋆})

ℓ (\overline{x}, \overline{u}) + \overset{ν}{^}^{⋆ T} f (\overline{x}, \overline{u}) + ξ (\overset{ν}{^}^{⋆}, \hat{λ}_{c}^{⋆}, \hat{λ}_{α}^{⋆}) \leq J_{D} (\overline{x}, \overline{u}) .

ℓ (\overline{x}, \overline{u}) + \overset{ν}{^}^{⋆ T} f (\overline{x}, \overline{u}) + ξ (\overset{ν}{^}^{⋆}, \hat{λ}_{c}^{⋆}, \hat{λ}_{α}^{⋆}) \leq J_{D} (\overline{x}, \overline{u}) .

ℓ (\overline{x}, \overline{u}) + \overset{ν}{^}^{⋆ T} f (\overline{x}, \overline{u}) + ξ (\overset{ν}{^}^{⋆}, \hat{λ}_{c}^{⋆}, \hat{λ}_{α}^{⋆}) \leq Q^{⋆} (\overline{x}, \overline{u}),

ℓ (\overline{x}, \overline{u}) + \overset{ν}{^}^{⋆ T} f (\overline{x}, \overline{u}) + ξ (\overset{ν}{^}^{⋆}, \hat{λ}_{c}^{⋆}, \hat{λ}_{α}^{⋆}) \leq Q^{⋆} (\overline{x}, \overline{u}),

ε (x, u; Q_{I}) := T_{Q} Q_{I} (x, u) - Q_{I} (x, u) .

ε (x, u; Q_{I}) := T_{Q} Q_{I} (x, u) - Q_{I} (x, u) .

Q_{I + 1} (\cdot, \cdot) = max {q_{I + 1} (\cdot, \cdot), Q_{I} (\cdot, \cdot)} .

Q_{I + 1} (\cdot, \cdot) = max {q_{I + 1} (\cdot, \cdot), Q_{I} (\cdot, \cdot)} .

Q_{I + 1} (\overset{x}{^}, \overset{u}{^}) > Q_{I} (\overset{x}{^}, \overset{u}{^}),

Q_{I + 1} (\overset{x}{^}, \overset{u}{^}) > Q_{I} (\overset{x}{^}, \overset{u}{^}),

{ε (x_{m}, π (x_{m}; Q_{I_{m}}); Q_{I_{m}})}_{I_{m} = 0}^{\infty}

{ε (x_{m}, π (x_{m}; Q_{I_{m}}); Q_{I_{m}})}_{I_{m} = 0}^{\infty}

Q_{I_{m} + 1} (x_{m}, π (x_{m}; Q_{I_{m}})) - Q_{I_{m}} (x_{m}, π (x_{m}; Q_{I_{m}}))

Q_{I_{m} + 1} (x_{m}, π (x_{m}; Q_{I_{m}})) - Q_{I_{m}} (x_{m}, π (x_{m}; Q_{I_{m}}))

= ε (x_{m}, π (x_{m}; Q_{I_{m}}); Q_{I_{m}}) .

Q_{I_{m} + 1} (x_{m}, π (x_{m}; Q_{I_{m}})) - Q_{I_{m}} (x_{m}, π (x_{m}; Q_{I_{m}})) \geq δ .

Q_{I_{m} + 1} (x_{m}, π (x_{m}; Q_{I_{m}})) - Q_{I_{m}} (x_{m}, π (x_{m}; Q_{I_{m}})) \geq δ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Learning continuous $Q$ -functions using generalized Benders cuts

Joseph Warrington The author is with the Automatic Control Laboratory, Swiss Federal Institute of Technology (ETH) Zurich, Physikstrasse 3, 8092 Zurich, Switzerland. Contact: [email protected]

Abstract

$\bm{Q}$ -functions are widely used in discrete-time learning and control to model future costs arising from a given control policy, when the initial state and input are given. Although some of their properties are understood, $\bm{Q}$ -functions generating optimal policies for continuous problems are usually hard to compute. Even when a system model is available, optimal control is generally difficult to achieve except in rare cases where an analytical solution happens to exist, or an explicit exact solution can be computed. It is typically necessary to discretize the state and action spaces, or parameterize the $\bm{Q}$ -function with a basis that can be hard to select a priori. This paper describes a model-based algorithm based on generalized Benders theory that yields ever-tighter outer-approximations of the optimal $\bm{Q}$ -function. Under a strong duality assumption, we prove that the algorithm yields an arbitrarily small Bellman optimality error at any finite number of arbitrary points in the state-input space, in finite iterations. Under additional assumptions, the same guarantee holds when the inputs are determined online by the algorithm’s updating $\bm{Q}$ -function. We demonstrate these properties numerically on scalar and 8-dimensional systems.

I Introduction

Reinforcement learning (RL) and approximate dynamic programming (ADP) commonly employ so-called $Q$ -functions to model the costs incurred in the future evolution of a discrete-time system under a given control policy. The $Q$ -function associated with control policy $u=\pi(x)$ takes a state $\hat{x}$ and input $\hat{u}$ as parameters, and is equal to the stage costs incurred immediately for $(\hat{x},\hat{u})$ plus the costs (typically infinite-horizon with a discount factor) of following policy $\pi$ thereafter.

$Q$ -functions are widely associated with RL (i.e., model-free learning), thanks to work stemming from Watkins’ $Q$ -learning algorithm [17]. However, model-free $Q$ -learning suffers from slow convergence, even despite new insights into optimizing the rate [7]. New work such as [11] is an example of interest in cases where model data, known or itself learned, can improve learning performance for difficult control problems. The present paper is motivated by a desire to learn an approximate $Q$ -function to control a system with a known model.

Mathematically, $Q$ -functions have much in common with value (or $V$ -) functions, the chief difference being that they are defined on state-input space rather than on the state alone. Although they are generally more expensive to store, their higher-dimensional domain often makes approximate, finitely-parameterized $Q$ -functions more expressive than $V$ -functions [4, Ch. 2].

It is common to discretize continuous problems in order to obtain a finite parameterization of the $V$ - or $Q$ -function [5]. However, performing even one iteration of the canonical algorithms, such as value iteration, then has an undesirable exponential cost. ADP methods have arisen to find more tractable parameterizations of the continuous $V$ -function. Several are based on continuous extensions of the “linear programming approach” to ADP [6], in which a valid lower bound on the optimal value function is maximized. Examples include the quadratic lower bound in [14], and the polynomial derived using sum-of-squares techniques in [13]. Approximate $V$ -functions represented as the pointwise maximum of multiple lower-bounding functions have been used in [10, 15, 1, 9]. Recent work utilizing a point-wise maximum representation [16] has extended the Benders decomposition argument used for linear multi-stage decision problems in Dual DP (DDP, [12]), to a general nonlinear, infinite-horizon setting.

In this paper we adapt the Benders approach from [16] to learn $Q$ -functions. We define an algorithm that successively produces tighter approximations of a problem’s optimal $Q$ -function from below, and prove convergence results for off-policy and policy-driven learning of the $Q$ -function in this manner. In the former case, $(x,u)$ pairs are pre-selected at the start of the algorithm, whereas in the latter case, only the $x$ points are pre-selected, and the $u$ decisions are made according to a policy from the update $Q$ -function estimate. We then demonstrate the method’s efficacy for test systems.

Section II describes the infinite horizon problem, Section III describes the Benders decomposition approach, and Section IV proposes an algorithm and proves its key properties. Section V presents numerical examples, and Section VI concludes.

II Problem statement

II-A Infinite-horizon control problem

The scope considered is the class of infinite-horizon, discrete-time, deterministic optimal control problems with time-invariant stage cost functions, dynamics, and constraints:

[TABLE]

For each time step $t$ we denote the state $x_{t}\in\mathcal{X}\subseteq\mathbb{R}^{n_{x}}$ , and the action, or input, $u_{t}\in\mathcal{U}\subseteq\mathbb{R}^{n_{u}}$ . Sets $\mathcal{X}$ and $\mathcal{U}$ are the state and action spaces, and are continuous. Future costs are discounted according to a discount factor $\gamma\in\left(0,1\right]$ , the (non-negative) stage cost function is $\ell:\mathcal{X}\times\mathcal{U}\rightarrow\mathbb{R}_{+}$ , and the dynamics are governed by the mapping $f:\mathcal{X}\times\mathcal{U}\rightarrow\mathcal{X}$ . There are $n_{c}$ state-input constraints (1c), parameterized by a vector-valued mapping $h:\mathcal{X}\times\mathcal{U}\rightarrow\mathbb{R}^{n_{c}}$ . The parametric infimum $V^{\star}(x)$ of problem (1) is referred to as the optimal value function (or optimal $V$ -function) of the problem.

II-B $Q$ -functions

We now define $Q$ -functions and briefly state some of their well-known properties for later use. For more detail, see for example [4, Chapter 2]. Given a policy $\pi:\mathcal{X}\rightarrow\mathcal{U}$ , its associated $Q$ -function, $Q^{\pi}:\mathcal{X}\times\mathcal{U}\rightarrow\mathbb{R}\cup\{+\infty\}$ , is

[TABLE]

in which the relation $x_{t+1}=f(x_{t},\pi(x_{t}))$ holds for $t\geq 1$ , and $x_{1}=f(x,u)$ . The $Q$ -function is the sum of the stage cost incurred for some initial state and input $x$ and $u$ , and the infinite sum of (discounted) costs under policy $\pi$ thereafter.

The optimal $Q$ -function, which we denote $Q^{\star}$ , minimizes (2) over policies $\pi$ , and satisfies

[TABLE]

for all $(x,u)\in\mathcal{X}\times\mathcal{U}$ . In (3) we use the notation $\mathcal{U}(x):=\{u\in\mathcal{U}\,:\,h(x,u)\leq 0\}$ . An associated Bellman operator for $Q$ -functions, $\mathcal{T_{Q}}$ , can be defined as

[TABLE]

On the left-hand side, $\mathcal{T_{Q}}Q$ is to be interpreted as a new function with the same domain as $Q$ , and evaluated at $(x,u)$ . Thus, condition (3) can be written $\mathcal{T_{Q}}Q^{\star}(x,u)=Q^{\star}(x,u)$ for all $(x,u)\in\mathcal{X}\times\mathcal{U}$ . If an optimal $Q$ - and $V$ -function exist for problem (1), they are related by $V^{\star}(x)=\inf_{u\in\mathcal{U}(x)}Q^{\star}(x,u)$ , and thus from (3), $Q^{\star}(x,u)=\ell(x,u)+\gamma V^{\star}(f(x,u))$ .

For any approximate $Q$ -function for which the infimum in (4) is attained, one can define an associated control policy consistent with definition (2):

[TABLE]

The attraction of a $Q$ -function is that in a wide range of cases it is simpler to solve (5) than it would be to solve, for the same $x$ , the full infinite-horizon problem (1), or a finite-horizon truncation thereof, as in Model Predictive Control (MPC) [2].

Lastly, for the benefit of developments in Section III, we note it is easy to show that the operator $\mathcal{T_{Q}}$ is monotonic:

[TABLE]

III Benders cuts

III-A Pointwise maximum representation

Let $Q_{I}:\mathcal{X}\times\mathcal{U}\rightarrow\mathbb{R}$ be a function of the following “pointwise maximum” form,

[TABLE]

where $I$ is a non-negative integer, and each function $q_{i}:\mathcal{X}\times\mathcal{U}\rightarrow\mathbb{R}$ is known to satisfy

[TABLE]

Thus $\smash{Q_{I}(x,u)\leq Q^{\star}(x,u)}$ for all $\smash{(x,u)\in\mathcal{X}\times\mathcal{U}}$ . From (5) the control policy associated with $Q_{I}$ is simply $\pi(x;Q_{I})\in\arg\min_{u\in\mathcal{U}(x)}\max_{i=0,\ldots,I}\{q_{i}(x,u)\}$ .

In Section IV we will propose an algorithm that uses $Q_{I}$ to construct an additional function, or “cut” $q_{I+1}$ . Under certain assumptions, the new cut satisfies

[TABLE]

Thus the new function, $\smash{Q_{I+1}(x,u):=\max\{Q_{I}(x,u),}$ $q_{I+1}(x,u)\}$ , will be a tighter under-approximation of $Q^{\star}$ than $Q_{I}$ . We now derive a Benders-type procedure to achieve this, which is related to that in [16] for $V$ -functions.

III-B Duality in operator $\mathcal{T_{Q}}$

We start by taking the dual of the minimization problem solved inside the operator $\mathcal{T_{Q}}$ at some point $(\hat{x},\hat{u})$ in the state-action space. For a function $Q_{I}$ taking the form (7), the right-hand side of (4) can be written equivalently as

[TABLE]

where the extra variable $x^{\prime}\in\mathbb{R}^{n_{x}}$ is introduced to model the successor state explicitly. An epigraph variable $\alpha\in\mathbb{R}$ can be introduced to replace the inconvenient maximum operator in the objective with $I+1$ separate constraints. This leads to an equivalent problem:

[TABLE]

Assigning the Lagrange multipliers $\nu\in\mathbb{R}^{n_{x}}$ , $\lambda_{c}\in\mathbb{R}^{n_{c}}_{+}$ , and $\lambda_{\alpha}\in\mathbb{R}^{I+1}_{+}$ to constraints (9b), (9c), and (9d) respectively, one can form the Lagrangian,

[TABLE]

Following standard procedure, the dual of (9) is then

[TABLE]

where the function

[TABLE]

depends only on the multipliers. Although problem (9) may not be convex, the objective of (10) is always concave [3, §5.2], and weak duality implies $J_{D}(\hat{x},\hat{u})\leq\mathcal{T_{Q}}Q_{I}(\hat{x},\hat{u})$ for any choice of parameter $(\hat{x},\hat{u})\in\mathcal{X}\times\mathcal{U}$ .

III-C Generalized Benders cut

Given a function $Q_{I}$ of the form (7) such that $Q_{I}\leq Q^{\star}$ , suppose that optimal multipliers $(\hat{\nu}^{\star},\hat{\lambda}_{c}^{\star},\hat{\lambda}_{\alpha}^{\star})$ are attained when (10) is solved with parameter $(\hat{x},\hat{u})$ . These can be used to form a new cut $q_{I+1}(\cdot,\cdot)$ with the following attractive properties.

Lemma III.1.

The function

[TABLE]

satisfies $q_{I+1}(x,u)\leq Q^{\star}(x,u)$ for all $(x,u)\in\mathcal{X}\times\mathcal{U}$ .

Proof.

An optimal dual solution $(\hat{\nu}^{\star},\hat{\lambda}_{c}^{\star},\hat{\lambda}_{\alpha}^{\star})$ for parameter $(\hat{x},\hat{u})$ must in general be a suboptimal solution to problem (10) when any other parameter $(\overline{x},\overline{u})\in\mathcal{X}\times\mathcal{U}$ is used, i.e.,

[TABLE]

Note that $(\hat{\nu}^{\star},\hat{\lambda}_{c}^{\star},\hat{\lambda}_{\alpha}^{\star})$ is feasible in (10) for all parameters $(\overline{x},\overline{u})$ , as the feasible set is independent of the parameter. From weak duality, $J_{D}(\overline{x},\overline{u})\leq\mathcal{T_{Q}}Q_{I}(\overline{x},\overline{u})$ . As we start with $Q_{I}\leq Q^{\star}$ on its domain, we have from the mononoticity property (6) that $\mathcal{T_{Q}}Q_{I}(\overline{x},\overline{u})\leq\mathcal{T_{Q}}Q^{\star}(\overline{x},\overline{u})$ , and the Bellman optimality condition states that $\mathcal{T_{Q}}Q^{\star}(\overline{x},\overline{u})=Q^{\star}(\overline{x},\overline{u})$ . Combining these relationships we obtain

[TABLE]

and the result follows simply by noting that $(\overline{x},\overline{u})$ can refer to any $(x,u)\in\mathcal{X}\times\mathcal{U}$ in the argument above. ∎

This proof leverages the (generalized) Benders decomposition argument, which was first developed in [8] to partition a two-stage problem into two subproblems linked by an approximate value function. Here we have used the properties of $Q$ -functions to accommodate the infinite number of stages in problem (1). A similar result was derived for $V$ -functions in [16].

The following properties concern the violation of the Bellman optimality condition (3), or the $Q$ -Bellman error:

[TABLE]

Lemma III.2.

If $\varepsilon(x,u;Q_{I})\geq 0$ for all $(x,u)\in\mathcal{X}\times\mathcal{U}$ , then $\varepsilon(x,u;Q_{I+1})\geq 0$ for all $(x,u)\in\mathcal{X}\times\mathcal{U}$ , where

[TABLE]

Proof.

A simple adaptation of [16, Lemma III.3]. ∎

Lemma III.3.

Suppose strong duality holds between problems (9) and (10) and that $\varepsilon(x,u;Q_{I})=\mathcal{T_{Q}}Q_{I}(x,u)-Q_{I}(x,u)\geq 0$ for all $(x,u)\in\mathcal{X}\times\mathcal{U}$ . Then if at some $(\hat{x},\hat{u})$ we have $\mathcal{T_{Q}}Q_{I}(\hat{x},\hat{u})>Q_{I}(\hat{x},\hat{u})$ , a cut there is strictly improving:

[TABLE]

and the increase is equal to $\varepsilon(\hat{x},\hat{u};Q_{I})$ .

Proof.

If strong duality holds, we have $J_{D}(\hat{x},\hat{u})=\mathcal{T_{Q}}Q_{I}(x,u)$ , and the new function $q_{I+1}$ satisfies $q_{I+1}(\hat{x},\hat{u})=\mathcal{T_{Q}}Q_{I}(\hat{x},\hat{u})$ . Since $Q_{I+1}(x,u):=\max\{Q_{I}(x,u),q_{I+1}(x,u)\}$ the result follows. ∎

Lastly, the following property facilitates a “greedy” cut $q_{I+1}(\cdot,\cdot)$ with respect to some particular $(\hat{x},\hat{u})$ location.

Lemma III.4.

The Benders cut that yields the greatest increase at $(\hat{x},\hat{u})$ , i.e., for which $Q_{I+1}(\hat{x},\hat{u})-Q_{I}(\hat{x},\hat{u})$ is maximized, is that obtained by solving problem (10) at $(\hat{x},\hat{u})$ .

Proof.

The result follows by reversing the roles of $(\hat{x},\hat{u})$ and $(\overline{x},\overline{u})$ in the proof of Lemma III.1. ∎

IV Benders algorithm for Q-Functions

We propose Algorithm 1 as a means of approximating $Q^{\star}$ by generating Benders cuts of the form (11). It starts with $q_{0}=\ell$ , which from (2) trivially lower-bounds $Q^{\star}$ , and by Lemmas III.2 and III.3 guarantees $\varepsilon(x,u;Q_{I})\geq 0$ for all $(x,u)\in\mathcal{X}\times\mathcal{U}$ and for all $I\geq 0$ . New cuts are created at certain points $(x_{m},u_{m})$ , and Variants A and B differ in how these are chosen:

A.

Select a list of state-input pairs $\mathcal{Z}_{\rm Alg}:=\{(x_{1},u_{1}),\ldots,$ $(x_{M},u_{M})\}$ a priori, and choose a random $(x_{m},u_{m})$ at each algorithm iteration.

B.

Select a list of state space points $\mathcal{X}_{\rm Alg}:=\{x_{1},\ldots,x_{M}\}$ a priori, and within the algorithm pick a random $x_{m}$ , letting $u_{m}$ follow from policy (5) parameterized by $Q_{I}$ .

We now state convergence results for both variants.

IV-A Fixed $(x,u)$ pairs

The following results hold for Variant A. We omit the proofs of both, because they carry across with little modification from the $V$ -function results in [16, Thms. III.5 and III.6]:

Theorem IV.1 (Pointwise convergence of $\{Q_{I}(x,u)\}_{I=0}^{\infty}$ ).

For each $\smash{(x,u)\in\mathcal{X}\times\mathcal{U}}$ for which $Q^{\star}(x,u)$ is finite, there exists a limiting value $\smash{Q_{\rm lim}(x,u)\leq Q^{\star}(x,u)}$ such that $\lim_{I\rightarrow\infty}Q_{I}(x,u)=Q_{\rm lim}(x,u)$ .

Theorem IV.2 (Finite termination of Variant A).

Suppose the following conditions are met:

(i)

Strong duality holds for the one-stage problem (9) with parameter $(x_{m},u_{m})$ each time it is solved, for each $(x_{m},u_{m})\in\mathcal{Z}_{\rm Alg}$ . 2. (ii)

$Q^{\star}(x_{m},u_{m})$ * is finite for each pair $(x_{m},u_{m})\in\mathcal{Z}_{\rm Alg}$ .*

Then Variant A of Algorithm 1 terminates in finite iterations with probability $1$ for any tolerance $\varepsilon_{\text{tol}}>0$ .

IV-B Fixed $x$ , policy-driven $u$

Although Variant A has attractive convergence properties, it learns a $Q$ -function based only on performance at pre-selected pairs $(x,u)$ , in the sense of minimizing the $Q$ -Bellman error there. Variant B instead learns a $Q$ -function based on performance at $(x,u)$ pairs in which the $u$ is consistent with the policy derived from the learnt $Q$ -function. One expects this criterion to be more relevant to performance of the final policy, as state-input trajectories will pass closer to these points.

Finite termination of Variant B is our main result, which we now state precisely along with the required assumptions.

Assumption 1.

For each $x_{m}\in\mathcal{X}_{\rm Alg}$ , the set of feasible inputs $\mathcal{U}(x_{m})$ contains an element $\hat{u}$ such that $Q^{\star}(x_{m},\hat{u})<\infty$ .

This assumption implies $V^{\star}(x_{m})$ is finite for each $x_{m}$ . Introducing the notation $\underline{Q}(x_{m}):=\inf_{u\in\mathcal{U}(x_{m})}Q(x_{m},u)$ , the following holds:

Theorem IV.3 (Monotone convergence of $\{\underline{Q}_{I}(x_{m})\}_{I=0}^{\infty}$ ).

Under Assumption 1, the limit $\underline{Q}_{\lim}(x_{m}):=\lim_{I\rightarrow\infty}\underline{Q}_{I}(x_{m})$ exists for each $x_{m}\in\mathcal{X}_{\rm Alg}$ .

Proof.

It follows from Assumption 1 that $\underline{Q}^{\star}(x_{m})<\infty$ , and from Lemma III.1, $Q_{I}(x_{m},u)\leq Q^{\star}(x_{m},u)$ for all $u\in\mathcal{U}(x_{m})$ . Thus, $\underline{Q}_{I}(x_{m})<\infty$ at each iteration $I$ . As the sequence of functions $\{Q_{I}\}_{I=0}^{\infty}$ increases monotonically, the sequence $\{\underline{Q}_{I}(x_{m})\}_{I=0}^{\infty}$ must also increase monotonically. This latter sequence is bounded from above, thus the limit $\lim_{I\rightarrow\infty}\underline{Q}_{I}(x_{m})=\underline{Q}_{\lim}(x_{m})$ exists from the Monotone Convergence Theorem. ∎

An additional performance guarantee for Variant B is available when the following additional assumptions hold.

Assumption 2.

For each $x_{m}\in\mathcal{X}_{\rm Alg}$ , set $\mathcal{U}(x_{m})$ is compact, and each entry of $f(x_{m},u)$ is Lipschitz-continuous on $\mathcal{U}(x_{m})$ .

Assumption 3.

The problem data in (1) is such that the lower-bounding functions $q_{0},q_{1},\ldots$ generated in Variant B:

(i)

Maintain strong duality between problems (9) and (10) with parameter $(x_{m},u)$ at each iteration of the algorithm, with $u=\pi(x_{m};Q_{I})$ , for all $x_{m}\in\mathcal{X}_{\rm Alg}$ .

(ii)

Are Lipschitz continuous in $u$ with some constant $L_{m}$ common to all functions $q_{i}$ , for each $x_{m}\in\mathcal{X}_{\rm Alg}$ .

Assumptions 2 and 3 must be verified for a given problem. A widespread setting where these hold is the constrained, stable linear-quadratic regulator (LQR); see the Appendix.

Theorem IV.4 (Finite termination of Variant B).

Suppose that in addition to Assumption 1, Assumptions 2 and 3 hold. Then Variant B of Algorithm 1 terminates in finite iterations with probability $1$ for any tolerance $\varepsilon_{\text{tol}}>0$ .

Proof.

Let the sequence of iterations $I$ where a given $m$ is chosen in line 15 of the algorithm be indexed by $I_{m}$ . With probability $1$ , this sequence is infinitely long for each $m$ . We now show that the sequence of $Q$ -Bellman errors

[TABLE]

is a Cauchy sequence converging to zero for each $x_{m}\in\mathcal{X}_{\rm Alg}$ . As $\mathcal{U}(x_{m})$ is compact for all $x_{m}\in\mathcal{X}_{\rm Alg}$ , the policy $\pi(x_{m};Q_{I})$ defined in (5) can always be evaluated.

Recall that Lemma III.2 implies $\varepsilon(x_{m},\pi(x_{m};Q_{I_{m}});$ $Q_{I_{m}})\geq 0$ for all $x_{m}$ and $I_{m}$ . Suppose for the sake of contradiction that the sequence $\{\varepsilon(x_{m},\pi(x_{m};Q_{I_{m}});Q_{I_{m}})\}_{I_{m}=0}^{\infty}$ is not a Cauchy sequence converging to [math]. Then there must exist some $\delta>0$ for which there is no iteration number beyond which $\varepsilon(x_{m},\pi(x_{m};Q_{I_{m}});Q_{I_{m}})<\delta$ . Whenever point $x_{m}$ is picked in line 15 of the algorithm, the strong duality condition in Assumption 3 and Lemma III.3 together imply that

[TABLE]

If $\{\varepsilon(x_{m},\pi(x_{m};Q_{I_{m}});Q_{I_{m}})\}_{I_{m}=0}^{\infty}$ is not a Cauchy sequence, there will be an infinite number of occasions on which

[TABLE]

Furthermore, Assumption 1 and part (ii) of Assumption 3 together imply that

[TABLE]

as $Q_{I}$ is always a lower bound on $Q^{\star}$ . Compactness of $\mathcal{U}(x_{m})$ implies that the volume of the truncated hypograph

[TABLE]

is finite for each $x_{m}$ ; recall that $q_{0}(\cdot,\cdot)\equiv\ell(\cdot,\cdot)\geq 0$ .

Due to Lipschitz continuity, cut $q_{I_{m}+1}$ decreases the volume of $\mathcal{H}_{m}$ by an amount that is lower bounded by a function of $\delta$ , $L_{m}$ , and the input dimension $n_{u}$ . Thus, this volume cannot be removed infinitely many times from $\mathcal{H}_{m}$ , and we have a contradiction. Cuts made at iterations $I$ where some other index $m^{\prime}\neq m$ is picked in line 15 may also remove some volume from $\mathcal{H}_{m}$ , but this does not affect the argument. Thus $\{\varepsilon(x_{m},\pi(x_{m};Q_{I_{m}});Q_{I_{m}})\}_{I_{m}=0}^{\infty}$ is a Cauchy sequence converging to zero, and Algorithm 1 terminates in finite iterations for any $\varepsilon_{\text{tol}}>0$ . ∎

Therefore, under certain assumptions one need only specify $\mathcal{X}_{\rm Alg}=\{x_{1},\ldots,x_{M}\}$ , and Variant B minimizes the $Q$ -Bellman error at a $u\in\mathcal{U}(x_{m})$ associated with each $x_{m}\in\mathcal{X}_{\rm Alg}$ that is consistent with policy (5). One then expects the optimal $Q$ -function to be learnt more accurately around the policy surface than elsewhere in the state-action space.

V Numerical examples

We now report two numerical tests of Algorithm 1. In both cases, systems were of the class C-LQR described in the Appendix, for which finite termination of Variants A and B is guaranteed by Theorems IV.2 and IV.4 respectively, and lower bounding functions $q_{i}(\cdot,\cdot)$ are quadratic. All tests used the stage cost $\ell(x_{t},u_{t})=\tfrac{1}{2}x_{t}^{\top}x_{t}+\tfrac{1}{2}u_{t}^{\top}u_{t}$ , discount rate $\gamma=1$ , and termination tolerance $\varepsilon_{\text{tol}}=10^{-3}$ , with $h(x_{t},u_{t})$ encoding an input constraint $\|u_{t}\|_{\infty}\leq 1$ . Tests were implemented in Python with subproblems solved using Gurobi 7.0.2, on a computer with an Intel i7 CPU at 2.60 GHz and 16 GB RAM.

Scalar system

For ease of visualization, we used the simple system $x_{t+1}=0.9\,x_{t}+u_{t}$ , with $x_{t},u_{t}\in\mathbb{R}$ , and ran both algorithm variants. In Variant A, $\mathcal{Z}_{\rm Alg}$ contained 50 random states $x$ sampled uniformly from the interval $[0,3]$ , and a random $u\in[-1,1]$ associated with each $x$ . In Variant B the associated inputs were dropped to form $\mathcal{X}_{\rm Alg}$ . Fig. 1 shows convergence of the maximum $Q$ -Bellman error $\max_{x_{m}\in\mathcal{X}_{\rm Alg}}\varepsilon(x_{m},\pi(x;Q_{I});Q_{I})$ , with the mean error $\tfrac{1}{M}\sum_{m=1}^{M}\varepsilon(x_{m},\pi(x_{m};Q_{I});Q_{I})$ shown for comparison. Total time spent generating lower-bounding functions was 243 ms for Variant A and 121 ms for Variant B. For this simple system the optimal policy can be computed as $\pi^{\star}(x)=\min\{1,\max\{-0.5377x,-1\}\}$ . Fig. 2 shows the evolution of $Q_{I}$ at selected iterations; after termination, the policy (5) from Variant B was closer to the optimal policy than that from Variant A. The average closed-loop cost starting from points $x\in\mathcal{X}_{\rm Alg}$ was also lower at 2.47991, compared to 2.48716 for Variant A, and 2.47968 for the optimal policy.

Higher-dimensional systems

We tested Variant B for systems too large for the optimal policy to be computed exactly. 20 random 8-state, 3-input linear systems were created with $\|A\|\leq 0.99$ . For each system, $M$ points in $\mathcal{X}_{\rm Alg}$ were sampled from a normal distribution with zero mean and variance $25$ times the identity matrix, for $M\in\{10,20,50,100,200,500\}$ . Table I reports statistics upon termination. The number of iterations is roughly linear in $M$ , while computation time is roughly quadratic. The latter excludes the Bellman error measurement in line 11, on the basis that in practice, the convergence check for which it is used need not be carried out at every iteration.

It is likely that other ways of choosing $m$ in line 15, e.g. largest Bellman error, would reduce the number of iterations required, although the assumptions under which finite convergence can be guaranteed may differ. Nevertheless, total times are already modest, and we note that alternative “exact” DP approaches such as value iteration [4, Ch. 2] and explicit MPC [2] are impractically expensive for problems of this size.

VI Conclusion

This paper presented a general algorithm able to learn $Q$ -functions, in the sense of minimizing Bellman error at arbitrary state space locations, for infinite-horizon problems. Convergence results were provided, both for fixed pairs $(x,u)$ and for “policy-driven” pairs $(x,\pi(x))$ . A further variant of Algorithm 1 could augment $\mathcal{X}_{\rm Alg}$ with sequences of states $\{x_{\tau}\}$ following the policy at each iteration, i.e., $x_{\tau+1}=f(x_{\tau},\pi(x_{\tau};Q_{I}))$ . This would potentially learn a $Q$ -function that approaches $Q^{\star}$ around entire trajectories, which is stronger than minimizing $\varepsilon(x_{m},\pi(x_{m};Q_{I});Q_{I})$ in individual locations $x_{m}\in\mathcal{X}_{\rm Alg}$ .

An added attraction of our formulation is that in many cases problem (5) remains convex even when the $Q$ -function is not convex in the state. Future work will investigate such situations, and consider an extension to stochastic systems.

Acknowledgement

The author thanks Rahul Jain of the University of Southern California for valuable discussions on the topic of this paper.

Appendix A Examples for Assumption 3

An example of a class of problems where Assumptions 2 and 3 hold is that which we refer to as C-LQR, for which:

•

$f(x,u)=Ax+Bu$ ;

•

$\ell(x,u)=\tfrac{1}{2}x^{\top}Qx+\tfrac{1}{2}u^{\top}Ru$ , with $Q,R\succeq 0$ ;

•

$h(x,u)=Dx+Eu-\bar{h}$ , defining decoupled state and input constraints, where the latter are compact;

•

$\gamma||A||<1$ , meaning “discounted-asymptotically” stable.

Proposition A.1.

Any problem of class C-LQR satisfies Assumptions 2 and 3.

Proof.

Assumption 2 is satisfied trivially. The lower-bounding functions in constraint (9d) have the quadratic form

[TABLE]

and thus the problem remains convex at each iteration $I$ . A Slater point exists, namely any feasible $(x^{\prime},u^{\prime})$ together with any $\alpha>q_{i}(x^{\prime},u^{\prime})\,\forall i$ . Thus the strong duality condition in Assumption 3 holds.

To prove Lipschitz continuity in Assumption 3, one must bound the gradient in $u$ -space of the functions $q_{i}(x_{m},\cdot)$ for any given $x_{m}$ . Inspection of problem (9) shows that each new function $q_{I+1}$ depends on the existing functions $q_{0},\ldots,q_{I}$ , and (13) shows that, due to $u$ -compactness, a Lipschitz constant exists if the sequence $\{\|\nu_{I}\|\}_{I=0}^{\infty}$ is bounded. The KKT optimality conditions of (9) include the stationarity equations

[TABLE]

Without loss of generality, one can redefine the system with a linearly scaled input, $B\rightarrow\tilde{B}$ and $R\rightarrow\tilde{R}$ , such that $\|\tilde{B}\|=1$ , and then linearly scale the input constraints in $h(x,u)$ such that $\|E\|=1$ . Triangle inequalities then yield

[TABLE]

As $\mathcal{U}(x_{m})$ is compact and $x_{m}$ is fixed, the norms of $x^{\prime}=Ax_{m}+Bu$ and $u^{\prime}$ are both bounded by some constants $X$ and $U$ respectively. Eliminating $\|\lambda_{c,I+1}\|$ , one obtains

[TABLE]

Thus, $\{\|\nu_{I}\|\}_{I=0}^{\infty}$ can grow no larger than $\frac{\gamma(U\|D\|\cdot\|\tilde{R}\|+X\|Q\|)}{1-\gamma(\|D\|+\|A\|)}$ . As the state and input constraints are decoupled, $\|D\|$ can be made arbitrarily small by scaling the relevant rows of $h(x,u)$ . Thus the denominator can be made strictly positive, and functions $q_{i}$ of the form (13) are Lipschitz continuous. ∎

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P. N. Beuchat, J. C. Warrington, and J. Lygeros. Point-wise Maximum Approach to Approximate Dynamic Programming. In IEEE Conference on Decision and Control , Melbourne, Australia, 2017.
2[2] F. Borrelli, A. Bemporad, and M. Morari. Predictive Control for Linear and Hybrid Systems . Cambridge Univ. Press, 2017.
3[3] S. Boyd and L. Vandenberghe. Convex Optimization . Cambridge University Press, 2009.
4[4] L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst. Reinforcement learning and dynamic programming using function approximators , volume 39. CRC press, 2010.
5[5] C. S. Chow and J. N. Tsitsiklis. An optimal one-way multigrid algorithm for discrete-time stochastic control. IEEE Transactions on Automatic Control , 36(8):898–914, 1991.
6[6] D. P. de Farias and B. Van Roy. The Linear Programming Approach to Approximate Dynamic Programming. Operations Research , 51(6):850–865, 2003.
7[7] A. M. Devraj and S. Meyn. Zap Q-Learning. Advances in Neural Information Processing Systems (NIPS) 30 , pages 2235–2244, 2017.
8[8] A. M. Geoffrion. Generalized Benders decomposition. Journal of Optimization Theory and Applications , 10(4):237–260, 1972.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Learning continuous QQQ-functions using generalized Benders cuts

Abstract

I Introduction

II Problem statement

II-A Infinite-horizon control problem

II-B QQQ-functions

III Benders cuts

III-A Pointwise maximum representation

III-B Duality in operator TQ\mathcal{T_{Q}}TQ​

III-C Generalized Benders cut

Lemma III.1**.**

Proof.

Lemma III.2**.**

Proof.

Lemma III.3**.**

Proof.

Lemma III.4**.**

Proof.

IV Benders algorithm for Q-Functions

IV-A Fixed (x,u)(x,u)(x,u) pairs

Theorem IV.1** (Pointwise convergence of {QI(x,u)}I=0∞\{Q_{I}(x,u)\}_{I=0}^{\infty}{QI​(x,u)}I=0∞​).**

Theorem IV.2** (Finite termination of Variant A).**

IV-B Fixed xxx, policy-driven uuu

Assumption 1**.**

Theorem IV.3** (Monotone convergence of {Q‾I(xm)}I=0∞\{\underline{Q}_{I}(x_{m})\}_{I=0}^{\infty}{Q​I​(xm​)}I=0∞​).**

Proof.

Assumption 2**.**

Assumption 3**.**

Theorem IV.4** (Finite termination of Variant B).**

Proof.

V Numerical examples

Scalar system

Higher-dimensional systems

VI Conclusion

Acknowledgement

Appendix A Examples for Assumption 3

Proposition A.1**.**

Proof.

Learning continuous $Q$ -functions using generalized Benders cuts

II-B $Q$ -functions

III-B Duality in operator $\mathcal{T_{Q}}$

Lemma III.1.

Lemma III.2.

Lemma III.3.

Lemma III.4.

IV-A Fixed $(x,u)$ pairs

Theorem IV.1 (Pointwise convergence of $\{Q_{I}(x,u)\}_{I=0}^{\infty}$ ).

Theorem IV.2 (Finite termination of Variant A).

IV-B Fixed $x$ , policy-driven $u$

Assumption 1.

Theorem IV.3 (Monotone convergence of $\{\underline{Q}_{I}(x_{m})\}_{I=0}^{\infty}$ ).

Assumption 2.

Assumption 3.

Theorem IV.4 (Finite termination of Variant B).

Proposition A.1.