Exponential Convergence and stability of Howards's Policy Improvement   Algorithm for Controlled Diffusions

B. Kerimkulov; D. \v{S}i\v{s}ka; {\L}. Szpruch

arXiv:1812.07846·math.OC·May 25, 2020

Exponential Convergence and stability of Howards's Policy Improvement Algorithm for Controlled Diffusions

B. Kerimkulov, D. \v{S}i\v{s}ka, {\L}. Szpruch

PDF

TL;DR

This paper proves exponential convergence rates and stability for Howard's policy improvement algorithm applied to controlled diffusions, using backward stochastic differential equations to analyze the algorithm's robustness.

Contribution

It establishes the first global convergence rate and stability results for the continuous-time policy improvement algorithm in controlled diffusions.

Findings

01

Proves exponential convergence rate of the policy improvement algorithm.

02

Shows stability under perturbations in PDE solutions and maximization accuracy.

03

Introduces a novel proof technique using backward stochastic differential equations.

Abstract

Optimal control problems are inherently hard to solve as the optimization must be performed simultaneously with updating the underlying system. Starting from an initial guess, Howard's policy improvement algorithm separates the step of updating the trajectory of the dynamical system from the optimization and iterations of this should converge to the optimal control. In the discrete space-time setting this is often the case and even rates of convergence are known. In the continuous space-time setting of controlled diffusion the algorithm consists of solving a linear PDE followed by maximization problem. This has been shown to converge, in some situations, however no global rate of is known. The first main contribution of this paper is to establish global rate of convergence for the policy improvement algorithm and a variant, called here the gradient iteration algorithm. The second main…

Figures13

Click any figure to enlarge with its caption.

Equations344

d X_{s} = b^{α} (s, X_{s}) d s + σ (s, X_{s}) d W_{s}, s \in [t, T], X_{t} = x .

d X_{s} = b^{α} (s, X_{s}) d s + σ (s, X_{s}) d W_{s}, s \in [t, T], X_{t} = x .

J (t, x, α) := E [\int_{t}^{T} f^{α} (s, X_{s}^{t, x, α}) d s + g (X_{T}^{t, x, α})]

J (t, x, α) := E [\int_{t}^{T} f^{α} (s, X_{s}^{t, x, α}) d s + g (X_{T}^{t, x, α})]

v (t, x) = α \in A sup J (t, x, α) .

v (t, x) = α \in A sup J (t, x, α) .

\partial_{t} v + \frac{1}{2} tr (σ σ^{⊤} D_{x}^{2} v) + a \in A sup (b^{a} D_{x} v + f^{a}) v (T, x) = 0 on [0, T) \times R^{d}, = g (x) on x \in R^{d} .

\partial_{t} v + \frac{1}{2} tr (σ σ^{⊤} D_{x}^{2} v) + a \in A sup (b^{a} D_{x} v + f^{a}) v (T, x) = 0 on [0, T) \times R^{d}, = g (x) on x \in R^{d} .

a^{\ast}(t,x)=\arg\max_{a\in A}\big{(}b^{a}(t,x)(D_{x}v)(t,x)+f^{a}(t,x)\big{)}\,.

a^{\ast}(t,x)=\arg\max_{a\in A}\big{(}b^{a}(t,x)(D_{x}v)(t,x)+f^{a}(t,x)\big{)}\,.

\partial_{t} v^{n} + \frac{1}{2} tr (σ σ^{⊤} D_{x}^{2} v^{n}) + b^{a^{n}} D_{x} v^{n} + f^{a^{n}} v^{n} (T, \cdot) = 0 on [0, T) \times R^{d}, = g on x \in R^{d} .

\partial_{t} v^{n} + \frac{1}{2} tr (σ σ^{⊤} D_{x}^{2} v^{n}) + b^{a^{n}} D_{x} v^{n} + f^{a^{n}} v^{n} (T, \cdot) = 0 on [0, T) \times R^{d}, = g on x \in R^{d} .

a^{n + 1} (t, x) = ar g a \in A max [(b^{a} D_{x} v^{n} + f^{a}) (t, x)] .

a^{n + 1} (t, x) = ar g a \in A max [(b^{a} D_{x} v^{n} + f^{a}) (t, x)] .

a^{n} (t, x) = ar g a \in A max [(b^{a} D_{x} v^{n - 1} + f^{a}) (t, x)] .

a^{n} (t, x) = ar g a \in A max [(b^{a} D_{x} v^{n - 1} + f^{a}) (t, x)] .

\partial_{t} v^{n} + \frac{1}{2} tr (σ σ^{⊤} D_{x}^{2} v^{n}) + b^{a^{n}} D_{x} v^{n - 1} + f^{a^{n}} v^{n} (T, \cdot) = 0 on [0, T) \times R^{d}, = g on x \in R^{d} .

\partial_{t} v^{n} + \frac{1}{2} tr (σ σ^{⊤} D_{x}^{2} v^{n}) + b^{a^{n}} D_{x} v^{n - 1} + f^{a^{n}} v^{n} (T, \cdot) = 0 on [0, T) \times R^{d}, = g on x \in R^{d} .

∥ ϕ ∥_{H_{γ}^{2}} := (E \int_{0}^{T} e^{γ s} ∣ ϕ_{s} ∣^{2} d s)^{\frac{1}{2}} .

∥ ϕ ∥_{H_{γ}^{2}} := (E \int_{0}^{T} e^{γ s} ∣ ϕ_{s} ∣^{2} d s)^{\frac{1}{2}} .

∥ ϕ ∥_{S^{2}} := E [0 \leq r \leq T sup ∣ ϕ_{r} ∣^{2}] < \infty .

∥ ϕ ∥_{S^{2}} := E [0 \leq r \leq T sup ∣ ϕ_{r} ∣^{2}] < \infty .

(ϕ ∙ W)_{t} := \int_{0}^{t} ϕ_{s} d W_{s} .

(ϕ ∙ W)_{t} := \int_{0}^{t} ϕ_{s} d W_{s} .

E (M)_{t} := exp (M_{t} - \frac{1}{2} ⟨ M ⟩_{t}) .

E (M)_{t} := exp (M_{t} - \frac{1}{2} ⟨ M ⟩_{t}) .

b : A \times [0, T] \times R^{d} \to R^{d} and σ : [0, T] \times R^{d} \to R^{d \times d^{'}} .

b : A \times [0, T] \times R^{d} \to R^{d} and σ : [0, T] \times R^{d} \to R^{d \times d^{'}} .

∣ b^{a} (t, x) - b^{a} (t, y) ∣ + ∣ σ (t, x) - σ (t, y) ∣ \leq K ∣ x - y ∣

∣ b^{a} (t, x) - b^{a} (t, y) ∣ + ∣ σ (t, x) - σ (t, y) ∣ \leq K ∣ x - y ∣

∣ σ (t, x) ∣ \leq K (1 + ∣ x ∣), ∣ b^{a} (t, x) ∣ \leq K (1 + ∣ x ∣ + ∣ a ∣) .

∣ σ (t, x) ∣ \leq K (1 + ∣ x ∣), ∣ b^{a} (t, x) ∣ \leq K (1 + ∣ x ∣ + ∣ a ∣) .

f : A \times [0, T] \times R^{d} \to R and g : R^{d} \to R

f : A \times [0, T] \times R^{d} \to R and g : R^{d} \to R

∣ g (x) - g (y) ∣ + ∣ f^{a} (t, x) - f^{a} (t, y) ∣ \leq K ∣ x - y ∣

∣ g (x) - g (y) ∣ + ∣ f^{a} (t, x) - f^{a} (t, y) ∣ \leq K ∣ x - y ∣

∣ f^{a} (t, x) ∣ \leq K (1 + ∣ x ∣ + ∣ a ∣), ∣ g (x) ∣ \leq K .

∣ f^{a} (t, x) ∣ \leq K (1 + ∣ x ∣ + ∣ a ∣), ∣ g (x) ∣ \leq K .

\partial_{t} v + \frac{1}{2} tr (σ σ^{⊤} D_{x}^{2} v) + a \in A sup (b^{a} D_{x} v + f^{a}) v (T, x) = 0 on [0, T) \times R^{n}, = g (x) on x \in R^{d} .

\partial_{t} v + \frac{1}{2} tr (σ σ^{⊤} D_{x}^{2} v) + a \in A sup (b^{a} D_{x} v + f^{a}) v (T, x) = 0 on [0, T) \times R^{n}, = g (x) on x \in R^{d} .

a (t, x, z) := ar g a \in A max (b^{a} (t, x) σ^{- 1} (t, x) z + f^{a} (t, x)) .

a (t, x, z) := ar g a \in A max (b^{a} (t, x) σ^{- 1} (t, x) z + f^{a} (t, x)) .

∣ b^{a} (t, x) - b^{a^{'}} (t, x) ∣ \leq θ ∣ a - a^{'} ∣

∣ b^{a} (t, x) - b^{a^{'}} (t, x) ∣ \leq θ ∣ a - a^{'} ∣

∣ (b^{a} σ^{- 1}) (t, x) ∣ < K .

∣ (b^{a} σ^{- 1}) (t, x) ∣ < K .

∣ a (t, x, z) - a (t, x, z^{'}) ∣ \leq θ ∣ z - z^{'} ∣,

∣ a (t, x, z) - a (t, x, z^{'}) ∣ \leq θ ∣ z - z^{'} ∣,

∣ a (t, x, z) - a (t, x^{'}, z) ∣ \leq K ∣ x - x^{'} ∣ and ∣ a (t, 0, 0) ∣ \leq K .

∣ a (t, x, z) - a (t, x^{'}, z) ∣ \leq K ∣ x - x^{'} ∣ and ∣ a (t, 0, 0) ∣ \leq K .

∣ f^{a} (t, x) - f^{a^{'}} (t, x) ∣ \leq θ ∣ a - a^{'} ∣ \forall t \in [0, T], \forall x \in R^{d}, \forall a, a^{'} \in A .

∣ f^{a} (t, x) - f^{a^{'}} (t, x) ∣ \leq θ ∣ a - a^{'} ∣ \forall t \in [0, T], \forall x \in R^{d}, \forall a, a^{'} \in A .

∣ f^{a (t, x, z)} (t, x) - f^{a (t, x, z^{'})} (t, x) ∣ \leq θ ∣ z - z^{'} ∣

∣ f^{a (t, x, z)} (t, x) - f^{a (t, x, z^{'})} (t, x) ∣ \leq θ ∣ z - z^{'} ∣

∣ f^{a (t, x, 0)} (t, x) ∣ \leq (K + K^{2}) (1 + ∣ x ∣) .

∣ f^{a (t, x, 0)} (t, x) ∣ \leq (K + K^{2}) (1 + ∣ x ∣) .

d X_{s} = b^{a (s, X_{s}, σ (s, X_{s}) D_{x} v (s, X_{s}))} (s, X_{s}) d s + σ (s, X_{s}) d W_{s}, s \in [t, T], X_{t} = x

d X_{s} = b^{a (s, X_{s}, σ (s, X_{s}) D_{x} v (s, X_{s}))} (s, X_{s}) d s + σ (s, X_{s}) d W_{s}, s \in [t, T], X_{t} = x

W_{s} := W_{s} + \int_{0}^{s} b^{α_{r}^{*}} (r, X_{r}) σ^{- 1} (r, X_{r}) d r

W_{s} := W_{s} + \int_{0}^{s} b^{α_{r}^{*}} (r, X_{r}) σ^{- 1} (r, X_{r}) d r

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Exponential Convergence and stability of Howard’s Policy Improvement Algorithm for Controlled Diffusions

B. Kerimkulov

Maxwell Institute Graduate School in Analysis and its Applications, Edinburgh, UK.

[email protected]

,

D. Šiška

School of Mathematics, University of Edinburgh and Vega Protocol

[email protected]

and

Ł. Szpruch

School of Mathematics, University of Edinburgh and Alan Turing Institute

[email protected]

(Date: 9th March 2024, )

Abstract.

Optimal control problems are inherently hard to solve as the optimization must be performed simultaneously with updating the underlying system. Starting from an initial guess, Howard’s policy improvement algorithm separates the step of updating the trajectory of the dynamical system from the optimization and iterations of this should converge to the optimal control. In the discrete space-time setting this is often the case and even rates of convergence are known. In the continuous space-time setting of controlled diffusion the algorithm consists of solving a linear PDE followed by a maximization problem. This has been shown to converge; in some situations, however no global rate is known. The first main contribution of this paper is to establish global rate of convergence for the policy improvement algorithm and a variant, called here the gradient iteration algorithm. The second main contribution is the proof of stability of the algorithms under perturbations to both the accuracy of the linear PDE solution and the accuracy of the maximization step. The proof technique is new in this context as it uses the theory of backward stochastic differential equations.

Key words and phrases:

Policy Improvement Algorithm, Stochastic Control, Backward Stochastic Differential Equation

2010 Mathematics Subject Classification:

93E20, 60H30, 65N12, 49L20

Supported by the Maxwell Institute Graduate School in Analysis and its Applications, a Centre for Doctoral Training funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016508/01), the Scottish Funding Council, Heriot-Watt University and the University of Edinburgh.

1. Introduction

Stochastic control problems arise naturally in a range of applications in engineering, economics, and finance. Apart from very specific cases such as linear-quadratic control in engineering or the Merton portfolio optimization task in finance, stochastic control problems typically have no closed form solutions and have to be solved numerically. In this paper we consider the policy iteration algorithm and gradient iteration algorithm; see Algorithms 1 and 2. These are effectively a linearization method for the inherently nonlinear problem and play an essential role in numerical solutions of stochastic control problems.

We will consider the continuous space, continuous time problem where the controlled system is modeled by an $\mathbb{R}^{d}$ -valued diffusion process. Let $W$ be a $d^{\prime}$ -dimensional Wiener martingale on a filtered probability space $(\Omega,\mathcal{F},(\mathcal{F}_{t})_{t\geq 0},\mathbb{P})$ . Let us fix a finite time $T\in(0,\infty)$ and consider the controlled SDE

[TABLE]

Here $\alpha=(\alpha_{s})$ is a control belonging to the space of admissible controls $\mathcal{A}$ , valued in $A\subseteq\mathbb{R}^{m}$ , and we will write $X^{t,x,\alpha}$ to denote the solution of (1) which starts from $x$ at time $t$ while being controlled by $\alpha$ . We shall consider the gain functional in the form

[TABLE]

for all $(t,x)\in[0,T]\times\mathbb{R}^{d}$ and $\alpha\in\mathcal{A}$ . The value function $v=v(t,x)$ is given for all $t\in[0,T]$ and $x\in\mathbb{R}^{d}$ by

[TABLE]

We wish to solve the optimization problem, i.e., to find either the value function $v$ or the optimal control $\alpha^{*}$ which achieves the maximum (or, if the supremum cannot be reached by $\alpha\in\mathcal{A}$ , then an $\varepsilon$ -optimal control $\alpha^{\varepsilon}\in\mathcal{A}$ such that $v(t,x)\leq J(t,x,\alpha^{\varepsilon})+\varepsilon$ ). It is well known that (see, e.g., Krylov [6]) that under reasonable assumptions the value function satisfies the Bellman PDE:

[TABLE]

Moreover (again see Krylov [6]), it is sufficient to consider Markovian controls, i.e., processes $\alpha_{s}=a(s,X_{s}^{t,x,\alpha})$ for some measurable function $a:[0,T]\times\mathbb{R}^{d}\to A$ . Thus if we have obtained the value function, then we can find the optimal control (if it exists) as

[TABLE]

It is rarely possible to find a closed form solution to (4) and so various approximations have to be employed. One may, for example, choose to use a finite difference method to discretize (4) and indeed this has been widely studied; see, e.g., [12] or [14] and references therein. This results in a high dimensional nonlinear system of equations that still retains the structure of (4). To solve this nonlinear system one may apply the Howard’s policy improvement algorithm. The rate of convergence would then follow from results available on discrete space-time control problems. However, to check that the assumptions required for convergence are satisfied is not straightforward and moreover it is dependent on the discretization scheme used.

An alternative approach is to linearize (4) and to iterate. The classical approach is the Bellman–Howard policy improvement/iteration algorithm. The algorithm is initialized with a “guess” of the Markovian control. Given a Markovian control strategy at step $n$ one solves a linear PDE with the given control fixed and then one uses the solution to the linear PDE to update the Markovian control. In this paper we will show that this policy improvement algorithm (see Algorithm 1) and a variant which we call the gradient iteration algorithm (see Algorithm 2) converge, under appropriate assumptions, exponentially fast.

Iterative algorithms for the solution of optimal control problems go back to the work of Bellman [1, 2] where the value iteration algorithms for finite space-time problems are developed and their convergences are shown. Howard [3] proposed the policy improvement algorithm in the context of the discrete space-time Markovian decision rocess. Puterman and Brumelle’s paper [4] was one of the first results on the convergence properties for the policy iteration for MDP problems. The abstract function space setting employed in the paper applies to both discrete and continuous settings. Their main observation is that the policy iteration can be viewed as a type of Newton’s method. Hence similar convergence results to those known for Newton’s method follow: in particular, if the initial guess is in a neighborhood of the true solution, then the convergence will be quadratic. Puterman [5] applied this in a setting very similar to that of this paper to prove quadratic convergence in the neighborhood of the limit. Santos and Rust [9] consider the discrete time but continous space and controls setting. They extend the results of Puterman and Brumelle [4] to show global convergence, but without global rate, and quadratic local convergence rate of policy iteration and superlinear local convergence under more general conditions. In the case of stochastic control problems with jump-diffusion processes, Bäuerle and Rieder [17] have proved a convergence result of the Howard’s policy improvement algorithm with the help of martingale techniques. In the fully discrete space and time setting Bokanowski, Maroso, and Zidani [13] have shown global superlinear convergence, under a monotonicity assumption on the matrices defining the control problem. Convergence of policy iteration has been recently proved by Jacka and Mijatović [18] and Jacka, Mijatović, and Siraj [19]. Further, Maeda and Jacka [20] have shown quadratic local convergence of the policy iteration algorithm for the time-independent control problem. The local quadratic convergence is similar to the result of Puterman [5] but the specific control problem is different and moreover they employ a completely different technique based on Schauder estimates for linear PDEs.

The main contributions of this paper are to establish a global rate of convergence and stability for the policy iteration algorithm and a variant, which we call the gradient iteration algorithm. The analysis is carried out using backward stochastic differential equations (BSDEs) and to the best knowledge of the authors this is the first time BSDEs have been used to study convergence of the policy iteration algorithm. The assumptions required for this are effectively Lipschitz dependence in the drift, diffusion, instantaneous payoff, and terminal payoff functions and independence of the diffusion matrix on the control; see (1). The stability results show that the policy iteration remains stable even if the linear PDE is solved only approximately and even if the maximization is step performed approximately. Moreover they allow one to devise computationally efficient algorithms as they show that in the initial steps it is sufficient to solve the linear PDE with very low accuracy, and a highly accurate PDE solver is only required for the final few iterations of the algorithms.

The paper is organized as follows. In Section 2 we introduce all the assumptions and notation used throughout the paper. In Sections 3 and 4 we state and prove the results concerning convergence of the gradient iteration algorithm and the policy improvement algorithm, respectively. Section 5 justifies the name “policy improvement algorithm” in that it shows that the value functions increase monotonically with iterations and it also shows that the algorithm converges under weaker assumptions than those required for obtaining the rate. Sections 6 and 7 prove the stability of the algorithms. In Section 8 we present an example that fits the setting of this paper. Finally, in Appendix A, we collect several known results from the theory of BSDEs that are essential for the proofs.

We would like to emphasize that Algorithm 1 and Algorithm 2 are different, although they look rather similar. In Algorithm 1, $v^{n}$ is the value function for the Markov control $a^{n}$ , since it solves the PDE (5). In Algorithm 2 $v^{n}$ is not the value function for the Markov control $a^{n}$ . This is due to the term $b^{a^{n}}D_{x}v^{n-1}$ in the linear PDE (8).

2. Assumptions and Notation

We fix a finite horizon $T\in(0,\infty)$ . We assume that for some $m\in\mathbb{N}$ we have $A\subseteq R^{m}$ such that $0\in A$ . This is the space where the control processes $\alpha$ take values. We fix a filtered probability space $(\Omega,\mathcal{F},\mathbb{F}=(\mathcal{F}_{t})_{0\leq t\leq T},\mathbb{P})$ . Let $W=(W_{t})_{t\in[0,T]}$ be a $d^{\prime}$ -dimensional Wiener martingale on this space. Moreover, we have the following:

(i)

For $\gamma>0$ and a predictable process $\phi$ let us define

[TABLE]

For $\gamma=0$ we will write $\|\cdot\|_{\mathbb{H}^{2}}$ . We will use $\mathbb{H}^{2}$ to denote the set of all predictable processes $\phi$ such that $\|\phi\|_{\mathbb{H}^{2}}<\infty$ . Note that the norm $\|\cdot\|_{\mathbb{H}^{2}}$ is equivalent to the norm $\|\cdot\|_{\mathbb{H}^{2}_{\gamma}}$ for any $\gamma\geq 0$ . 2. (ii)

Let $\mathcal{S}^{2}$ be the set of real valued $\mathbb{F}$ -adapted continuous processes $\phi$ on $[0,T]$ such that

[TABLE] 3. (iii)

For adapted processes $\phi$ such that $\int_{0}^{t}|\phi_{s}|^{2}\,ds<\infty$ almost surely we will define

[TABLE] 4. (iv)

For any continuous local martingale $M$ let with $(\langle M\rangle_{t})_{t\in[0,T]}$ denote the quadratic variation process and moreover let

[TABLE]

We are given measurable functions

[TABLE]

The state of the system is governed by the controlled SDE (1).

Assumption 2.1.

The functions $b$ and $\sigma$ are continuous in $t$ . There exists $K\geq 0$ and such that $\forall x,y\in\mathbb{R}^{d},\forall a\in A,\forall t\in[0,T]$ ,

[TABLE]

and

[TABLE]

Under Assumption 2.1 we know that for any $(t,x)\in[0,T]\times\mathbb{R}^{d}$ and for any progressively measurable $A$ -valued control process $\alpha=(\alpha_{s})$ there is a unique strong solution to (1) which we denote $(X^{t,x,\alpha}_{s})_{s\in[t,T]}$ . Let

[TABLE]

be two given measurable functions. Let us assume the following for the running gain function $f$ and the terminal gain function $g$ appearing in (2).

Assumption 2.2.

There is a constant $K\geq 0$ such that $\forall x,y\in\mathbb{R}^{d},\forall a\in A,\forall t\in[0,T]$

[TABLE]

and

[TABLE]

Under Assumption 2.2 the gain functional $J$ given by (2) and the value function $v$ given by (3) are well defined. Moreover, the value function $v$ satisfies the Bellman equation (with derivatives existing almost everywhere, see Krylov [6, Chapter 4], or in the sense of viscosity solutions, see, e.g., Pham [15] or Fleming and Soner [11])

[TABLE]

Let us now state the additional assumptions required for our convergence result.

Assumption 2.3.

Let us define for each fixed $(t,x,z)\in[0,T]\times\mathbb{R}^{d}\times\mathbb{R}^{d}$ the function

[TABLE]

We assume that the function $a(t,x,z)$ is measurable.

If the function $a\mapsto\left(b^{a}(t,x)\sigma^{-1}(t,x)z+f^{a}(t,x)\right)$ is convex for each fixed $(t,x,z)$ , which is in $[0,T]\times\mathbb{R}^{d}\times\mathbb{R}^{d}$ , one can immediately see that Assumption 2.3 holds. More generally, this assumption can be verified using an appropriate measurable selection theorem. For example, if $A$ is compact, then [7, Proposition D.5] shows that an appropriate measurable selection exists. If $A$ is not compact but $f$ is bounded, then [7, Proposition D.6] gives the same conclusion (using also that $z=D_{x}v(t,x)$ and Remark 2.8).

Assumption 2.4.

There are constants $K,\theta\geq 0$ such that the following hold:

(1)

(On the drift) For all $t\in[0,T]$ , $x\in\mathbb{R}^{d}$ , $a,a^{\prime}\in A$ ,

[TABLE]

and for all $t\in[0,T]$ , $x\in\mathbb{R}^{d}$ , $a\in A$ we have

[TABLE] 2. (2)

(On the control function) For all $t\in[0,T]$ , $x,x^{\prime},z,z^{\prime}\in\mathbb{R}^{d}$ , $a,a^{\prime}\in A$ we have that

[TABLE]

[TABLE] 3. (3)

(On the running reward)

[TABLE]

Remark 2.5.

Under Assumptions 2.2 and 2.4 we have that for all $t\in[0,T]$ , $x,z,z^{\prime}\in\mathbb{R}^{d}$ the following hold:

[TABLE]

and

[TABLE]

Under Assumptions 2.1, 2.2, 2.3, and 2.4 there is an optimal control process and this fact will be used to prove the main results.

Remark 2.6.

Due to results of Krylov [6] we know that (4) has a unique solution and moreover the map $[0,T]\times\mathbb{R}^{d}\ni(t,x)\mapsto D_{x}v(t,x)\in\mathbb{R}^{d}$ is bounded; see [6, Chapter 4, section 1, Theorem 1]. Hence, by Assumptions 2.3 and 2.4 we know that $(t,x)\mapsto a(t,x,\sigma(t,x)D_{x}v(t,x))$ is jointly measurable and Lipschitz in $x$ . Thus, for each $(t,x)\in[0,T]\times\mathbb{R}^{d}$ , the SDE

[TABLE]

has a unique solution $X^{t,x}$ . Then by the verification theorem, the process $\alpha^{\ast}_{s}:=a(s,X_{s},\sigma(s,X_{s})D_{x}v(s,X_{s}))$ is the optimal control process for (3).

All the proofs will be completed in a new measure $\hat{\mathbb{P}}$ on $(\Omega,\mathcal{F})$ given in the following lemma. We will use $\hat{\mathbb{E}}$ to denote the expectation under the measure $\hat{\mathbb{P}}$ .

Lemma 2.7.

Let Assumptions 2.1 and 2.2 together with (16) hold. Let $(t,x)\in[0,T]\times\mathbb{R}^{d}$ . Let $X=X^{t,x,\alpha^{\ast}}$ be the solution to the SDE (1) started from $(t,x)$ and controlled by the optimal control process $\alpha^{\ast}$ . Then $d\hat{\mathbb{P}}:=\mathcal{E}((b^{\alpha^{\ast}}\sigma^{-1})(\cdot,X)\bullet W)_{T}\,d\mathbb{P}$ is a probability measure equivalent to $\mathbb{P}$ and the process

[TABLE]

is a $\hat{\mathbb{P}}$ -Wiener process.

Proof.

This is an immediate consequence of (16) and Girsanov’s theorem. ∎

Remark 2.8.

From Krylov [6, Chapter 4, section 1, Theorem 1] we get that there is a constant $C>0$ such that for all $(t,x)\in[0,T)\times\mathbb{R}^{d}$ we have that $|D_{x}v(t,x)|\leq C$ .

3. Convergence of gradient iteration algorithm

The following theorem gives the convergence result for Algorithm 2.

Theorem 3.1.

Let Assumptions 2.1, 2.2, 2.3, and 2.4 hold. Let $v$ be the solution to (4) and let $(v^{n})_{n\in\mathbb{N}}$ be the approximation sequence given by Algorithm 2. Then there is $q\in(0,1)$ depending only on $K,\theta,T$ and the initial guess $v^{0}=v^{0}(t,x)$ such that for all $(t,x)\in[0,T]\times\mathbb{R}^{d}$ there exists $C=C(t,x)$ such that

[TABLE]

The main idea of the proof consists of noticing that Algorithm 2 can be seen as an iteration on the level of BSDEs. Using Lemma A.2 we see that on the level of BSDEs this iteration is contractive. Finally we need to use known results on the connection between BSDEs and solutions to the HJB equation.

Proof of Theorem 3.1.

We prove the main result in several steps. First, we show how to rewrite the gradient iteration algorithm as an iteration on the level of BSDEs. On the $n$ th step of the algorithm we need to solve the linear PDE with Lipschitz continuous coefficients (8). Let $v^{n}$ be the solution to (8) and recall that

[TABLE]

Since we are working with the linear PDE with Lipschitz continuous coefficients, we have $v^{n}$ in $C^{1,2}([0,T)\times\mathbb{R}^{d})$ . Let $X=X^{t,x,\alpha^{\ast}}$ be the solution to the SDE (1) started from $(t,x)$ and controlled by the optimal control process $\alpha^{\ast}$ ; see Remark 2.6. From Itô’s formula we then get that

[TABLE]

Let

[TABLE]

and

[TABLE]

Then we may write

[TABLE]

Let $\hat{\mathbb{P}}$ and $\widehat{W}$ be given by Lemma 2.7. Hence (18) becomes

[TABLE]

Consider now the following BSDE:

[TABLE]

where the superscript means that the forward process started from $(t,x)$ . Hence, we can define

[TABLE]

Therefore by (14) we have

[TABLE]

Thus, by Pham [15, Theorem 6.3.3], the function $w=w(t,x)$ solves the HJB equation (4). Notice that here is the crucial point where the fact that we use the optimal control $\alpha^{*}$ plays a role. Indeed with other control processes we couldn’t claim that $w$ solves the HJB equation. By uniqueness of the viscosity solution to the HJB equation (see the strong comparison principle from [15, Theorem 4.4.5]), we can conclude that $w=v$ and therefore $w$ is the value function of our stochastic control problem. Therefore, the BSDE (20) is the BSDE corresponding to the value function. Notice that (20) is a quadratic BSDE, since in the generator we have a product of two Lipschitz functions which depend on $Z$ . The existence of the solution to (20) under our assumptions can be obtained by applying Theorem A.9 in the case when the terminal cost is bounded for our stochastic control problem.

Using Remark 2.8, the fact that $\sigma^{-1}(s,X_{s})Z_{s}=D_{x}v(s,X_{s})$ , and Assumption 2.4, for all $s\in[t,T]$ we get that

[TABLE]

Moreover, recalling $\xi=g(X_{T})$ , we get that $\xi\in L^{2}(\Omega,\mathcal{F}_{T},\hat{\mathbb{P}})$ from the higher moment estimates for the solution of the SDE and from the Lipschitz property of $g$ . Similarly $F_{s}(0)\in\hat{\mathbb{H}}^{2}$ by Assumption 2.4. We may thus apply Lemma A.2 and hence, due to (19) and (22), we have $q\in(0,1)$ and $\gamma\geq 0$ such that for all $t\in[0,T]$

[TABLE]

Therefore, from (21) and (17), we have $v(t,x)=Y^{t,x}_{t}$ and $v^{n}(t,x)=Y^{n,t,x}_{t}$ and by (23) we obtain

[TABLE]

Hence

[TABLE]

This finishes the proof. ∎

4. Convergence of policy improvement

Theorem 4.1.

Let Assumptions 2.1, 2.2, 2.3, and 2.4 hold. Let $v$ be the solution to (4) and let $(v^{n})_{n\in\mathbb{N}}$ be the approximation sequence given by Algorithm 1. Then there is $q\in(0,1)$ depending only on $K,\theta,T$ and the initial guess $v^{0}=v^{0}(t,x)$ such that for all $(t,x)\in[0,T]\times\mathbb{R}^{d}$ there exists $C=C(t,x)$ such that

[TABLE]

The proof of Theorem 4.1 is similar to that of Theorem 3.1 except that the iteration on the level of BSDEs is nonstandard.

Proof of Theorem 4.1.

Let $v^{n}$ be the solution to (5) and recall that

[TABLE]

As before, let $X=X^{t,x,\alpha^{\ast}}$ be the solution to the SDE (1) started from $(t,x)$ and controlled by the optimal control process $\alpha^{\ast}$ ; see Remark 2.6. By Itô’s formula

[TABLE]

Let

[TABLE]

Recalling that the control $\alpha^{\ast}$ and the associated diffusion $X$ are fixed we can write

[TABLE]

Let $\hat{\mathbb{P}}$ and $\widehat{W}$ be given by Lemma 2.7. Then (24) becomes

[TABLE]

Similarly as in Theorem 3.1, consider the BSDE

[TABLE]

In same way we can show that $v(t,x)=Y^{t,x}_{t}$ is the value function of our stochastic control problem. As before, from Krylov [6, Chapter 4, section 1, Theorem 1], we get that there is a constant $C>0$ such that for all $(t,x)\in[0,T)\times\mathbb{R}^{d}$ we have that $|D_{x}v(t,x)|\leq C$ . Moreover, as before, using Remark 2.8, the fact that $\sigma^{-1}(s,X_{s})Z_{s}=D_{x}v(s,X_{s})$ , and Assumption 2.4, for all $s\in[t,T]$ we get that

[TABLE]

Finally we note that $\xi\in L^{2}(\Omega,\mathcal{F}_{T},\hat{\mathbb{P}})$ and $F_{s}(0,0)\in\hat{\mathbb{H}}^{2}$ , so by Lemma A.5, together with (25) and (26), we have $q\in(0,1)$ and $\gamma\geq 0$ such that for all $t\in[0,T]$

[TABLE]

Similarly as before, using (28), we conclude that

[TABLE]

This concludes the proof of the theorem. ∎

Remark 4.2.

Consider briefly the situation where the diffusion coefficient also depends on the control, i.e., $\sigma=\sigma^{a}(t,x)$ . After applying Itô’s formula to $v^{n}(s,X_{s})$ and substituting the solution to the linear PDE for $v^{n}$ we get

[TABLE]

*The resulting object can be seen as a second order BSDE (2BSDE). Analysis of 2BSDEs goes beyond the scope of this paper. *

Remark 4.3.

Let us briefly consider the infinite-time-horizon control problem. In this case we consider a constant $\lambda>0$ and the gain functional:

[TABLE]

It is known that the Bellman PDE for the value function is

[TABLE]

The linear PDE from the iteration of the policy improvement algorithm then is

[TABLE]

where

[TABLE]

After applying Itô’s formula we get

[TABLE]

Let $Y^{n}_{t}:=v^{n}(X_{t})$ and $Z^{n}_{t}:=\sigma(X_{t})D_{x}v^{n}(X_{t})$ . Then after change of measure we may write

[TABLE]

Let

[TABLE]

Hence (29) becomes

[TABLE]

To proceed, we need a suitable contraction-type inequality for this infinite time horizon BSDE. Buckdahn and Peng [8] studied infinite time horizon BSDEs and have proved existence and uniqueness of their solutions for sufficiently large values of $\lambda$ . To get the required contraction-type inequality we can use similar calculations as in Fuhrman and Tessitore [10, Theorems 3.2 and 3.7], where they use Banach’s fixed point theorem to show existence and uniqueness of solutions to infinite time horizon BSDEs. Hence, for sufficiently large $\lambda>0$ we would obtain results analogous to Theorem 4.1 as well as the other theorems in the article.

5. Policy improvement

We want to show that the policy obtained at each step of Algorithm 1 is an improvement on the one from the previous step. This is formulated as Theorem 5.1 below. Note that we do not require Assumption 2.4 here.

Theorem 5.1.

Let Assumptions 2.1, 2.2, and 2.3 hold. Assume that there exists $K\geq 0$ such that $\forall t\in[0,T]$ , $\forall x\in\mathbb{R}^{d}$ , and $\forall a\in A$

[TABLE]

Fix $n\in\mathbb{N}$ . Let $v^{n}$ and $v^{n+1}$ be the solutions of (5) at steps $n$ and $n+1$ of the algorithm. Then for all $t\in[0,T]$ , $x\in\mathbb{R}^{d}$ it holds that

[TABLE]

Proof.

Let $X=X^{t,x,\alpha^{\ast}}$ be the solution to the SDE (1) started from $(t,x)$ and controlled by the optimal control process $\alpha^{\ast}$ ; see Remark 2.6. Then, as in the proof of Theorem 4.1, we get that for $k=n,n+1$ with $Y^{k}=Y^{k,t,x}=v^{k}(\cdot,X^{t,x,\alpha^{\ast}})$ and with $Z^{k}=Z^{k,t,x}=(\sigma D_{x}v^{k})(\cdot,X^{t,x,\alpha^{\ast}})$ we have the BSDE representation

[TABLE]

where

[TABLE]

Let us denote for $s\in[t,T]$ and $z\in\mathbb{R}^{d}$

[TABLE]

Hence, notice that by the definition of the $a^{n+1}$ (see (6)), we have for all $s\in[t,T]$ that

[TABLE]

Therefore by the comparison principle for BSDEs (see Lemma A.6), we get

[TABLE]

Hence, we have

[TABLE]

∎

Remark 5.2.

It is perhaps interesting to note that the comparison principle for BSDEs cannot be used to deduce that in the gradient iteration algorithm we have an “improvement” at each step. Indeed, let us write the BSDE representation of the two steps of gradient iteration for $n,n+1\in\mathbb{N}$ ,

[TABLE]

and

[TABLE]

where

[TABLE]

In order to apply a comparison principle for BSDEs (see Lemma A.6), we would need to have $F_{s}(Z^{n-1}_{s})\leq F_{s}(Z^{n}_{s})$ . Nevertheless we observe that

[TABLE]

Similarly,

[TABLE]

From the above calculations we have no way to conclude that $F_{s}(Z^{n-1}_{s})\leq F_{s}(Z^{n}_{s})$ . Thus the gradient iteration algorithm is not guaranteed to be improving the policy with each step.

6. Stability under Perturbations to Solution of the Linear PDE

In this section we study a stability property of the policy improvement algorithm under perturbations to solutions of the linear PDE (5) since in practical applications one will only solve this equation approximately. Of course the maximization step (6) of Algorithm 1 can now be performed only with this approximate solution, thus feeding the errors into further iterations.

Let $\varepsilon$ be a parameter (or a set of parameters), which determines the accuracy of our approximation to the solution of the linear PDE (5). Let $\pi^{n}_{\varepsilon}$ be the policy at iteration $n$ obtained from an approximate solution to the linear PDE. Let $v_{\varepsilon}^{n}$ denote the solution to

[TABLE]

At step $n$ of Algorithm 1 we approximate the solution to the equation above (this is PDE (5) but with $\pi^{n}_{\varepsilon}$ replacing $a^{n}$ everywhere). We will denote such approximation by $\tilde{v}^{n}_{\varepsilon}$ . The policy function for the next iteration step is then given by

[TABLE]

recalling that the function $a=a(t,x,z)$ was defined in (14). We need to assume that $(t,x)\mapsto D_{x}\tilde{v}_{\varepsilon}^{n}$ is bounded so that $\pi_{\varepsilon}^{n+1}$ is Lipschitz in $x$ so that the solution to (31) is $C^{1,2}([0,T]\times\mathbb{R}^{d})$ . This assumption is not really a restriction as we know that the gradient of the value function is bounded under our assumptions; see Krylov [6, Chapter 4, section 1, Theorem 1] and also Remark 2.6. Any reasonable approximation should retain this property.

Theorem 6.1.

Let Assumptions 2.1, 2.2, 2.3, and 2.4 hold. Let $(v^{n})_{n\in\mathbb{N}}$ be the approximation sequence given by Algorithm 1. Let $(v^{n}_{\varepsilon})_{n\in\mathbb{N}}$ be the approximation sequence given by (31). Let $\alpha^{*}$ and $X^{t,x,\alpha^{*}}$ be the optimal control process for (3) and the associated diffusion started from $(t,x)\in[0,T]\times\mathbb{R}^{d}$ . Assume that $D_{x}\tilde{v}_{\varepsilon}^{n}$ is uniformly bounded. Define

[TABLE]

Then there is $q\in(0,1)$ and $\gamma>0$ , depending only on $K,\theta,T$ , such that for all $(t,x)\in[0,T]\times\mathbb{R}^{d}$ there exists $C=C(t,x)$ such that

[TABLE]

Proof.

Let $X=X^{t,x,\alpha^{\ast}}$ be the solution to the SDE (1) started from $(t,x)$ and controlled by the optimal control process $\alpha^{\ast}$ ; see Remark 2.6. By applying Itô’s formula to $v^{n}_{\varepsilon}$ we get

[TABLE]

Let us denote

[TABLE]

and

[TABLE]

where $\tilde{v}^{n-1}_{\varepsilon}$ is an approximate solution to corresponding PDE. Then using this notation, we may write

[TABLE]

Let $\hat{\mathbb{P}}$ and $\widehat{W}$ be given by Lemma 2.7. Then the above equation becomes

[TABLE]

We want to study the difference of $(Y^{n}_{\varepsilon},Z^{n}_{\varepsilon})$ with $(Y^{n},Z^{n})$ , where $(Y^{n},Z^{n})$ solves the BSDE (25).

[TABLE]

where $(Y,Z)$ solves the BSDE (26). Due to (27) and

[TABLE]

we can apply Lemma A.5. Hence, there is $\tilde{q}\in(0,1/2)$ and $\gamma>0$ such that

[TABLE]

and

[TABLE]

Therefore we continue the estimate (32)

[TABLE]

This concludes the proof of the theorem. ∎

7. Stability under Perturbation of the Maximization

In this section we study a stability property of the gradient iteration algorithm under perturbations to maximization procedure (7). Let $\bar{v}^{n}$ be the solution to corresponding PDE at iteration $n$ of the gradient iteration algorithm, where instead of obtaining the control function corresponding to the exact maximum

[TABLE]

we only solve this maximization problem approximately and so we are dealing with a control function of the form

[TABLE]

where the function $\varepsilon=\varepsilon(t,x,z)$ determines the accuracy of our approximation.

Theorem 7.1.

Let Assumptions 2.1, 2.2, 2.3, and 2.4 hold. Let $(v^{n})_{n\in\mathbb{N}}$ be the approximation sequence given by Algorithm 2. Let $(\bar{v}^{n})_{n\in\mathbb{N}}$ be the approximation sequence given by the perturbations to the maximization procedure and assume that $v^{0}=\bar{v}^{0}$ . Let $\alpha^{\ast}$ and $X^{t,x,\alpha^{\ast}}$ be the optimal control process for (3) and the associated diffusion started from $(t,x)\in[0,T]\times\mathbb{R}^{d}$ . Define

[TABLE]

Then there is $q\in(0,1)$ and $\gamma>0$ , depending only on $K,\theta,T$ , such that for all $(t,x)\in[0,T]\times\mathbb{R}^{d}$ there exists $C=C(t,x)$ such that

[TABLE]

Proof.

Let $X=X^{t,x,\alpha^{\ast}}$ be the solution to the SDE (1) started from $(t,x)$ and controlled by the optimal control process $\alpha^{\ast}$ ; see Remark 2.6. As in the proof of Theorem 3.1 we can write two BSDEs we get after the change of measure given by Lemma 2.7. The first BSDE arises from the perturbations of the maximization:

[TABLE]

where

[TABLE]

The second BSDE arises from the gradient iteration algorithm with the maximization performed exactly:

[TABLE]

where

[TABLE]

We want to study the difference of $(\bar{Y}^{n},\bar{Z}^{n})$ with $(Y^{n},Z^{n})$ . Hence, notice that

[TABLE]

where $(Y,Z)$ solves (20). Therefore, since

[TABLE]

we can apply Lemma A.7 so that there is $q\in(0,1)$ and $\gamma>0$ such that

[TABLE]

Now we need to estimate the second term of the right-hand side (RHS). Notice that by Assumption 2.4 the following holds:

[TABLE]

Hence by (35) we have

[TABLE]

By inequalities (33), (34), (36), and the result of Theorem 3.1 and since $\bar{Y}^{t,x,n}_{t}=\bar{v}^{n}(t,x),\,Y^{t,x,n}_{t}=v^{n}(t,x)$ as well as $Z^{0}=\bar{Z}^{0}$ , we conclude that

[TABLE]

∎

We obtain the same result for the policy improvement algorithm.

Theorem 7.2.

Let Assumptions 2.1, 2.2, 2.3, and 2.4 hold. Let $(v^{n})_{n\in\mathbb{N}}$ be the approximation sequence given by Algorithm 1. Let $(\bar{v}^{n})_{n\in\mathbb{N}}$ be the approximation sequence given by the perturbations to the maximization procedure. Let $\alpha^{\ast}$ and $X^{t,x,\alpha^{\ast}}$ be the optimal control process for (3) and the associated diffusion started from $(t,x)\in[0,T]\times\mathbb{R}^{d}$ . Define

[TABLE]

Then there is $q\in(0,1)$ and $\gamma>0$ , depending only on $K,\theta,T$ , such that for all $(t,x)\in[0,T]\times\mathbb{R}^{d}$ there is $C=C(t,x)$ such that

[TABLE]

Proof.

Let $X=X^{t,x,\alpha^{\ast}}$ be the solution to the SDE (1) started from $(t,x)$ and controlled by the optimal control process $\alpha^{\ast}$ ; see Remark 2.6. Due to Theorem 4.1 we can write two BSDEs we get after the change of measure: first from the perturbation and second from the gradient iteration

[TABLE]

where

[TABLE]

Similarly, we want to study the difference of $(\bar{Y}^{n},\bar{Z}^{n})$ with $(Y^{n},Z^{n})$ . Hence, notice that

[TABLE]

where $(Y,Z)$ solves (4.1). Therefore, since

[TABLE]

we can apply Lemma A.8 so that there is $q\in(0,1)$ and $\gamma>0$ such that

[TABLE]

Now we need to estimate the second term of the RHS. Notice that by Assumption 2.4 we have that

[TABLE]

Hence by (39) we have

[TABLE]

By inequalities (37), (38), (40), by the result of Theorem 4.1, and by $\bar{Y}^{t,x,n}_{t}=\bar{v}^{n}(t,x)$ , $Y^{t,x,n}_{t}=v^{n}(t,x)$ we conclude that

[TABLE]

∎

8. Example

In this section we would like to consider an example when Assumptions 2.1, 2.2, 2.3, and 2.4 hold. Let $t\mapsto s(t)$ and $t\mapsto k(t)$ be continuous functions for $t\in[0,T]$ . Consider the state which is governed by the controlled SDE

[TABLE]

and consider the cost functional

[TABLE]

The aim is to maximize $J$ over admissible controls $\alpha\in\mathcal{A}$ . The value function $v=\sup_{\alpha\in\mathcal{A}}J(t,x,\alpha)$ satisfies the Bellman PDE

[TABLE]

with the terminal condition $v(T,x)=g(x):=\arctan(x)$ . Hence, the optimal control is

[TABLE]

It is easy to check that Assumptions 2.1, 2.2, 2.3, and 2.4 hold for this problem. Therefore, the Bellman PDE becomes

[TABLE]

We can solve this problem using the policy improvement algorithm by approximating the Bellman PDE with a sequence of linear PDEs:

Step 1. Make an initial choice of control $a^{0}(t,x)$ .

Step 2. For $n=0,1,\dots$ :

•

Evaluation step: Find a solution $v^{n}=v^{n}(t,x)$ to the linear PDE

[TABLE]

•

Improvement step: Find a new policy $a^{n+1}=a^{n+1}(t,x)$ such that

[TABLE]

Step 3. Iterate the process until no changes occur in the controls updates.

One can do similar calculations in the case of the gradient iteration algorithm.

We will solve (41) and (42) by the finite difference method. For simplicity, let us choose $s(t)=1$ and $k(t)=1$ for all $t\in[0,T]$ . In Figure 8.1, one can see the logarithm of the error between the value function obtained by the iterative methods, by the policy improvement algorithm, and by the gradient iteration algorithm at every step and the value function obtained by the solution of the Bellman PDE. This shows the fast convergence of the policy improvement method for our example in one dimension. In Figure 8.2, we can see that after only a few steps the policies obtained from the policy improvement algorithm are close to the exact one. Finally, in Figure 8.3, we plot the value function and the policy from the solution of the Bellman PDE.

Appendix A Some results from theory of BSDEs

We fix a finite horizon $T\in(0,\infty)$ . We fix a filtered probability space $(\Omega,\mathcal{F},\mathbb{F}=(\mathcal{F}_{t})_{0\leq t\leq T},\mathbb{P})$ . Let there be a $d^{\prime}$ -dimensional Wiener martingale on this space.

Lemma A.1.

Let $F:\Omega\times[0,T]\times\mathbb{R}^{d}\to\mathbb{R}$ be a measurable function that satisfies the following conditions: The process $(F_{t}(0))_{t\in[0,T]}$ is in $\mathbb{H}^{2}$ . Moreover there is a constant $\theta>0$ such that

[TABLE]

Then, for every $\xi\in L^{2}(\Omega,\mathcal{F}_{T})$ and $z\in\mathbb{H}^{2}$ , there is a unique solution $(Y,Z)\in\mathcal{S}^{2}\times\mathbb{H}^{2}$ to

[TABLE]

Proof.

This follows immediately from, e.g., Pham [15, Theorem 6.2.1]. ∎

Lemma A.2.

Let $F:\Omega\times[0,T]\times\mathbb{R}^{d}\to\mathbb{R}$ satisfy the hypothesis of Lemma A.1. Fix $\xi\in L^{2}(\Omega,\mathcal{F}_{T})$ . Let $\Phi:\mathbb{H}^{2}\ni z\mapsto(Y,Z)\in\mathcal{S}^{2}\times{\mathbb{H}}^{2}$ , where $(Y,Z)$ is the unique solution to (43). Moreover assume that for $z^{1},z^{2}\in\mathbb{H}^{2}$ the following condition satisfies that there is a constant $\theta>0$ such that

[TABLE]

Then there is $\gamma>0$ and $q\in(0,1)$ such that for $(Y^{i},Z^{i}):=\Phi(z^{i})$ , $i=1,2$ , and any $t\in[0,T]$ we have

[TABLE]

The proof is well known and is included, e.g., as part of Pham [15, Proof of Theorem 6.2.1]. We provide it here for the convenience of the reader and before we proceed we need to make the following observation.

Remark A.3.

Assume that $Y\in\mathcal{S}^{2}$ and $Z\in\mathbb{H}^{2}$ and let

[TABLE]

Then $\sup_{t\leq T}|M_{t}|\in L^{1}(\Omega,\mathcal{F}_{T})$ and hence $M_{t}$ is a uniformly integrable martingale. Indeed, from the Burkholder–Davis–Gundy inequality and the Young inequality we get

[TABLE]

Proof of Lemma A.2.

Consider $\gamma>0$ which we will fix later. We denote $\delta z:=z^{1}-z^{2}$ , $\delta Z:=Z^{1}-Z^{2}$ , $\delta Y:=Y^{1}-Y^{2}$ , and $\delta F:=F(z^{1})-F(z^{2})$ . We then apply Itô’s formula to $e^{\gamma t}|\delta Y_{t}|^{2}$ :

[TABLE]

Due to Remark A.3, the stochastic integral vanishes by taking expectation. Hence

[TABLE]

By the Lipschitz property of the generator and by the Young inequality we continue our estimate, noting that for any $\varepsilon>0$ , we have

[TABLE]

Choose $\varepsilon$ such that $\gamma=\varepsilon\theta$ . Thus

[TABLE]

Hence, from (45) we have that for $\gamma>\theta^{2}$ and any $t\in[0,T]$

[TABLE]

where $q=\frac{\theta^{2}}{\gamma}\in(0,1)$ . This concludes the proof of the lemma. ∎

Lemma A.4.

Let $F:\Omega\times[0,T]\times\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}$ be a measurable function such that the process $(F_{t}(0,0))_{t\in[0,T]}$ is in $\mathbb{H}^{2}$ and such that there are $\theta,K>0$ so that for all $t\in[0,T]$ , $z,Z\in\mathbb{R}^{d}$ we have

[TABLE]

If $\xi\in L^{2}(\Omega,\mathcal{F}_{T})$ and $z\in\mathbb{H}^{2}$ , then there is a unique solution $(Y,Z)\in\mathcal{S}^{2}\times\mathbb{H}^{2}$ to

[TABLE]

Proof.

This follows immediately from, e.g., Pham [15, Theorem 6.2.1]. ∎

Lemma A.5.

Let $F:\Omega\times[0,T]\times\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}$ satisfy the hypothesis of Lemma A.4. Fix $\xi\in L^{2}(\Omega,\mathcal{F}_{T})$ . Let $\Phi:\mathbb{H}^{2}\ni z\mapsto(Y,Z)\in\mathcal{S}^{2}\times{\mathbb{H}}^{2}$ , where $(Y,Z)$ is the unique solution to (46). Moreover assume that for $z^{1},z^{2}\in\mathbb{H}^{2}$ the following condition satisfies that there are constants $\theta,K>0$ such that

[TABLE]

where $(Y^{i},Z^{i}):=\Phi(z^{i})$ , $i=1,2$ . Then there is $\gamma>0$ and $q\in(0,1)$ such that for any $t\in[0,T]$ we have

[TABLE]

Moreover, there is $\gamma>0$ such that $q\in(0,1/2)$ .

Proof.

Consider $\gamma>0$ which we will fix later. We denote $\delta z:=z^{1}-z^{2}$ , $\delta Z:=Z^{1}-Z^{2}$ , and $\delta Y:=Y^{1}-Y^{2}$ . We then apply Itô’s formula to $e^{\gamma t}|\delta Y_{t}|^{2}$ :

[TABLE]

The expectation of the stochastic integral is [math] due to Remark A.3. Hence, by taking expectation we derive from the equality above that

[TABLE]

By the Lipschitz property of the generator and by the Young inequality we observe that, for any $\varepsilon>0$ ,

[TABLE]

Take $\gamma>0$ sufficiently large so that $\tilde{q}:=\max(\frac{(\theta+K)K}{\gamma},\frac{(\theta+K)\theta}{\gamma})\in(0,1/2)$ . Choose $\varepsilon$ such that $\gamma=(\theta+K)\varepsilon$ . Thus

[TABLE]

Dividing by $1-\tilde{q}\in(1/2,1)$ we obtain

[TABLE]

where $q:=\frac{\tilde{q}}{1-\tilde{q}}$ . Since $0<\tilde{q}<1/2$ we have that $q\in(0,1)$ . Therefore from (47) we have for any $t\in[0,T]$

[TABLE]

By choosing $\gamma$ such that $\tilde{q}\in(0,1/3)$ we get that $q\in(0,1/2)$ . This concludes the proof of the lemma. ∎

We now state a comparison principle for BSDEs.

Lemma A.6.

Consider the following BSDEs:

[TABLE]

Assume that $\xi^{i}\in L^{2}(\Omega,\mathcal{F}_{T})$ , $i=1,2$ , and $\xi^{1}\leq\xi^{2}$ a.s. Let $\phi^{i}:\Omega\times[0,T]\times\mathbb{R}^{d}\rightarrow\mathbb{R}$ , $i=1,2$ , be such that for all $z\in\mathbb{R}^{d}$ the processes $(\phi^{i}(t,z))_{t\in[0,T]}$ are progressively measurable, $\phi^{i}(t,0)\in\mathbb{H}^{2}$ , and such that there is $\theta>0$ so that for all $t\in[0,T]$ , $z,z^{\prime}\in\mathbb{R}^{d}$ we have

[TABLE]

Moreover, suppose that for $Z^{1},Z^{2}\in\mathbb{H}^{2}$ it holds that

[TABLE]

Then $Y_{t}^{1}\leq Y_{t}^{2}$ for all $0\leq t\leq T$ a.s.

Proof.

This follows from, e.g., Pham [15, Theorem 6.2.2]. ∎

The following two lemmas are auxiliary results we need in Section 7.

Lemma A.7.

Let $F,\bar{F}:\Omega\times[0,T]\times\mathbb{R}^{d}\to\mathbb{R}$ be measurable functions and let $F$ satisfy the hypotheses of Lemmas A.1 and A.2. Fix $\xi\in L^{2}(\Omega,\mathcal{F}_{T})$ . Let $\bar{z},z,Z,\bar{Z}\in\mathbb{H}^{2}$ and $Y,\bar{Y}\in\mathcal{S}^{2}$ be such that

[TABLE]

and

[TABLE]

Then there is $\gamma>0$ and $q\in(0,1)$ such that for $t\in[0,T]$ we have

[TABLE]

Proof.

Consider $\gamma>0$ which we will fix later. We denote $\tilde{Y}:=\bar{Y}-Y$ , $\tilde{Z}:=\bar{Z}-Z$ , and $\tilde{z}:=\bar{z}-z$ . We then apply Itô’s formula to $e^{\gamma t}|\tilde{Y}_{t}|^{2}$ :

[TABLE]

Due to Remark A.3, the stochastic integral vanishes by taking expectation. Hence

[TABLE]

Notice that due to (44) for all $s\in[t,T]$ it holds that

[TABLE]

Then by the Young inequality we continue our estimate (48), noting that for any $\delta>0$ , we have

[TABLE]

Fix $\gamma>(1+\theta)\theta$ and $q=(1+\theta)\theta/\gamma$ . Let $\delta=\gamma/(1+\theta)$ . Then

[TABLE]

This concludes the proof of the lemma. ∎

Lemma A.8.

Let $\bar{F}:\Omega\times[0,T]\times\mathbb{R}^{d}\to\mathbb{R}$ be a measurable function and let $F$ satisfies the hypotheses of Lemmas A.4 and A.5. Fix $\xi\in L^{2}(\Omega,\mathcal{F}_{T})$ . Let $\bar{z},z,\bar{Z},Z\in\mathbb{H}^{2}$ and $\bar{Y},Y\in\mathcal{S}^{2}$ be such that

[TABLE]

and

[TABLE]

Then there is $\gamma>0$ and $q\in(0,1)$ such that for $t\in[0,T]$ we have

[TABLE]

Proof.

Consider $\gamma>0$ which we will fix later. We denote $\tilde{Y}:=\bar{Y}-Y$ , $\tilde{Z}:=\bar{Z}-Z$ , and $\tilde{z}:=\bar{z}-z$ . We then apply Itô’s formula to $e^{\gamma t}|\tilde{Y}_{t}|^{2}$ :

[TABLE]

Due to Remark A.3, the stochastic integral vanishes by taking expectation. Hence

[TABLE]

Notice that by assumptions of the lemma for all $s\in[t,T]$ it holds that

[TABLE]

Then by the Young inequality for any $\delta>0$ , we have

[TABLE]

Let us take $\gamma>0$ sufficiently large so that $\tilde{q}:=\max(\frac{(1+\theta+K)K}{\gamma},\frac{(1+\theta+K)\theta}{\gamma})\in(0,1/2)$ . Let $\delta:=\gamma/(1+\theta+K)$ so that

[TABLE]

Dividing by $(1-\tilde{q})\in(1/2,1)$ we obtain

[TABLE]

where $q:=\frac{\tilde{q}}{1-\tilde{q}}$ . Since $0<\tilde{q}<1/2$ we have that $q\in(0,1)$ . ∎

A.1. BSDE with drivers of quadratic growth.

Since we are using BSDE theory in the proof of the main result, we would like to present some results on BSDE with drivers of quadratic growth. We refer to [16].

Consider the following system:

[TABLE]

Theorem A.9 (Theorem 3.6 in [16]).

Let $b:[0,T]\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ and $\sigma:[0,T]\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{d\times d^{\prime}}$ be Lipschitz continuous with Lipschitz constant $C$ and $|b(t,0)|\leq C$ and $|\sigma(t,0)|\leq C$ for all $t\in[0,T]$ . Let $g:\mathbb{R}^{d}\rightarrow\mathbb{R}$ and $f:[0,T]\times\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}$ be measurable functions and let us assume that there exists constant $C$ such that for all $r\in\mathbb{R}^{+},t\in[0,T],x,x^{\prime}\in\mathbb{R}^{d}$ , and $z,z^{\prime}\in\mathbb{R}^{d}$

[TABLE]

There exists a solution $(Y,Z)$ of the Markovian BSDE (49) in $\mathcal{S}^{2}\times\mathbb{H}^{2}$ and this solution is unique among solutions $(Y,Z)\in\mathcal{S}^{2}\times\mathbb{H}^{2}$ such that $Y$ is bounded. Moreover, we have

[TABLE]

and

[TABLE]

where

[TABLE]

here the supremum is taken over all stopping times in $[0,T]$ .

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. Bellman, Functional equations in the theory of dynamic programming. V. Positivity and quasi-linearity, Proc. Natl. Acad. Sci. USA. , 41 (1955), pp. 743-746.
2[2] R. Bellman, Dynamic Programming , Princeton University Press, Princeton, NJ, USA, 1957.
3[3] R. A. Howard, Dynamic Programming and Markov Processes , MIT Press, Cambridge, MA, 1960.
4[4] M.L. Puterman and S. L. Brumelle, On the convergence of policy iteration in stationary dynamic programming, Math. Oper. Res. , 4 (1979), pp. 60-69.
5[5] M. L. Puterman, On the convergence of policy iteration for controlled diffusions, J. Optim. Theory Appl. , 33 (1981), pp. 137–144 .
6[6] N. V. Krylov, Controlled Diffusion Processes , Springer, New York, 1980.
7[7] O. Hernandez-Lerma and J. Lasserre, Discrete-Time Markov Control Processes , Springer, New York, 1996.
8[8] R. Buckdahn and S. Peng, Ergodic Backward SDE and Associated PDE, R.C. Dalang, M. Dozzi and F. Russo eds., Progr. Probab. 45 , Birkhäuser, Basel, 1999.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Exponential Convergence and stability of Howard’s Policy Improvement Algorithm for Controlled Diffusions

Abstract.

Key words and phrases:

2010 Mathematics Subject Classification:

1. Introduction

2. Assumptions and Notation

Assumption 2.1**.**

Assumption 2.2**.**

Assumption 2.3**.**

Assumption 2.4**.**

Remark 2.5**.**

Remark 2.6**.**

Lemma 2.7**.**

Proof.

Remark 2.8**.**

3. Convergence of gradient iteration algorithm

Theorem 3.1**.**

Proof of Theorem 3.1.

4. Convergence of policy improvement

Theorem 4.1**.**

Proof of Theorem 4.1.

Remark 4.2**.**

Remark 4.3**.**

5. Policy improvement

Theorem 5.1**.**

Proof.

Remark 5.2**.**

6. Stability under Perturbations to Solution of the Linear PDE

Theorem 6.1**.**

Proof.

7. Stability under Perturbation of the Maximization

Theorem 7.1**.**

Proof.

Theorem 7.2**.**

Proof.

8. Example

Appendix A Some results from theory of BSDEs

Lemma A.1**.**

Proof.

Lemma A.2**.**

Remark A.3**.**

Proof of Lemma A.2.

Lemma A.4**.**

Proof.

Lemma A.5**.**

Proof.

Lemma A.6**.**

Proof.

Lemma A.7**.**

Proof.

Lemma A.8**.**

Proof.

A.1. BSDE with drivers of quadratic growth.

Theorem A.9** (Theorem 3.6 in [16]).**

Assumption 2.1.

Assumption 2.2.

Assumption 2.3.

Assumption 2.4.

Remark 2.5.

Remark 2.6.

Lemma 2.7.

Remark 2.8.

Theorem 3.1.

Theorem 4.1.

Remark 4.2.

Remark 4.3.

Theorem 5.1.

Remark 5.2.

Theorem 6.1.

Theorem 7.1.

Theorem 7.2.

Lemma A.1.

Lemma A.2.

Remark A.3.

Lemma A.4.

Lemma A.5.

Lemma A.6.

Lemma A.7.

Lemma A.8.

Theorem A.9 (Theorem 3.6 in [16]).