Generalized Second Order Value Iteration in Markov Decision Processes

Chandramouli Kamanchi; Raghuram Bharadwaj Diddigi; Shalabh Bhatnagar

arXiv:1905.03927·cs.LG·September 21, 2021

Generalized Second Order Value Iteration in Markov Decision Processes

Chandramouli Kamanchi, Raghuram Bharadwaj Diddigi, Shalabh Bhatnagar

PDF

2 Repos

TL;DR

This paper introduces a second order value iteration method for Markov Decision Processes that accelerates convergence to the optimal value function by applying Newton-Raphson to the successive relaxation scheme, with proven convergence and demonstrated effectiveness.

Contribution

It proposes a novel second order value iteration algorithm based on Newton-Raphson applied to successive relaxation, improving convergence speed in MDPs.

Findings

01

Proves global convergence of the method

02

Demonstrates second order convergence rate

03

Shows improved efficiency through experiments

Abstract

Value iteration is a fixed point iteration technique utilized to obtain the optimal value function and policy in a discounted reward Markov Decision Process (MDP). Here, a contraction operator is constructed and applied repeatedly to arrive at the optimal solution. Value iteration is a first order method and therefore it may take a large number of iterations to converge to the optimal solution. Successive relaxation is a popular technique that can be applied to solve a fixed point equation. It has been shown in the literature that, under a special structure of the MDP, successive over-relaxation technique computes the optimal value function faster than standard value iteration. In this work, we propose a second order value iteration procedure that is obtained by applying the Newton-Raphson method to the successive relaxation value iteration scheme. We prove the global convergence of our…

Tables5

Table 1. Table I : Comparison of Average Error for different values of N 𝑁 N on 10 10 10 states and 5 5 5 actions setting at the end of 50 50 50 iterations. For the G-SOVI algorithm, the relaxation parameter is chosen to be the optimal relaxation parameter w ∗ superscript 𝑤 w^{*} , i.e., w = w ∗ 𝑤 superscript 𝑤 w=w^{*} .

Value of N	Standard Value	Standard	G-SOVI
	Iteration	SOVI
N=20	0.1009 $\pm$ 0.0026	0.1205 $\pm$ 0.0372	0.1093 $\pm$ 0.0818
N=25		0.0822 $\pm$ 0.0273	0.0648 $\pm$ 0.0217
N=30		0.0611 $\pm$ 0.0211	0.0494 $\pm$ 0.017
N=35		0.0484 $\pm$ 0.0168	0.0397 $\pm$ 0.0136

Table 2. Table II : Comparison of Average Error in G-SOVI for different values of w 𝑤 w on 10 10 10 states and 5 5 5 actions setting at the end of 50 50 50 iterations. The value of N 𝑁 N is 35.

Value of

w

G-SOVI

w = 1

(Standard SOVI)

0.04838

\pm

0.017

w = 1.00001

0.04838

\pm

0.017

w = 1.0001

0.04837

\pm

0.017

w = 1.001

0.04830

\pm

0.017

w = 1.01

0.0476

\pm

0.017

w = 1.05

0.0448

\pm

0.016

w = 1.1

0.0417

\pm

0.014

w = w^{*}

0.0397

\pm

0.014

Table 3. Table III : Comparison of Average Error across four settings at the end of 10 10 10 iterations with N = 35 𝑁 35 N=35 . For the G-SOVI algorithm, the relaxation parameter is chosen to be the optimal relaxation parameter w ∗ superscript 𝑤 w^{*} , i.e., w = w ∗ 𝑤 superscript 𝑤 w=w^{*} .

Setting

Standard

Value Iteration

Standard SOVI

G-SOVI

States = 30, Actions= 10

6.471

\pm

0.07

0.087

\pm

0.01

0.079

\pm

0.01

States = 50, Actions = 10

6.587

\pm

0.07

0.114

\pm

0.01

0.108

\pm

0.01

States = 80, Actions = 10

6.754

\pm

0.03

0.141

\pm

0.01

0.136

\pm

0.01

States = 100, Actions = 10

6.772

\pm

0.03

0.152

\pm

0.01

0.148

\pm

0.01

Table 4. Table IV : Per-iteration Execution time of algorithms across four settings in seconds, with the relaxation parameter in G-SOVI chosen as w = w ∗ 𝑤 superscript 𝑤 w=w^{*} .

Setting

Standard

Value Iteration

Standard SOVI

G-SOVI

States = 30, Actions= 10

0.0008

\pm

0.00

0.0154

\pm

0.01

0.0267

\pm

0.01

States = 50, Actions = 10

0.0009

\pm

0.00

0.0242

\pm

0.00

0.0488

\pm

0.00

States = 80, Actions = 10

0.0011

\pm

0.00

0.0532

\pm

0.00

0.0988

\pm

0.01

States = 100, Actions = 10

0.0026

\pm

0.00

0.1202

\pm

0.01

0.1343

\pm

0.01

Table 5. Table V : Average Error vs Computational Time (rounded off to the nearest millisecond). Initial Q-values for algorithms are assigned random integers between 60 60 60 and 70 70 70 . The discount factor is set to 0.99 0.99 0.99 . G-SOVI is run with w = 1.00001 𝑤 1.00001 w=1.00001 .

Configuration

Computational Time

(in seconds)

Standard

Value Iteration

Standard SOVI

G-SOVI

10 States, 5 Actions

0.01

25.485

\pm

2.21

3.930

\pm

0.92

3.885

\pm

0.94

20 States, 5 Actions

0.02

18.291

\pm

0.77

5.444

\pm

0.51

5.473

\pm

0.50

30 States, 5 Actions

0.03

7.327

\pm

0.20

7.111

\pm

0.32

7.118

\pm

0.33

Equations117

\displaystyle\mathbb{E}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},\pi(s_{t}),s_{t+1})\mid s_{0}=i\Big{]}.

\displaystyle\mathbb{E}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},\pi(s_{t}),s_{t+1})\mid s_{0}=i\Big{]}.

\displaystyle V^{*}(i)=\max_{a\in A}\Big{\{}\sum_{j=1}^{M}p(j|i,a)\big{(}r(i,a,j)+\gamma V^{*}(j)\big{)}\Big{\}},\leavevmode\nobreak\ \forall i\in S.

\displaystyle V^{*}(i)=\max_{a\in A}\Big{\{}\sum_{j=1}^{M}p(j|i,a)\big{(}r(i,a,j)+\gamma V^{*}(j)\big{)}\Big{\}},\leavevmode\nobreak\ \forall i\in S.

V_{n} (i) =

V_{n} (i) =

\leavevmode n \geq 1, \forall i \in S .

V^{*} = T V^{*},

V^{*} = T V^{*},

(TV)(i)=\max_{a\in A}\Big{\{}r(i,a)+\gamma\displaystyle\sum_{j=1}^{M}p(j|i,a)V(j)\Big{\}},

(TV)(i)=\max_{a\in A}\Big{\{}r(i,a)+\gamma\displaystyle\sum_{j=1}^{M}p(j|i,a)V(j)\Big{\}},

V^{*} = n \infty lim V_{n} = T V^{*} .

V^{*} = n \infty lim V_{n} = T V^{*} .

Q^{*} (i, a) := r (i, a) + γ j = 1 \sum M p (j ∣ i, a) V^{*} (j) .

Q^{*} (i, a) := r (i, a) + γ j = 1 \sum M p (j ∣ i, a) V^{*} (j) .

V^{*} (i) = a \in A max Q^{*} (i, a) .

V^{*} (i) = a \in A max Q^{*} (i, a) .

Q^{*} (i, a) = r (i, a) + γ j = 1 \sum M p (j ∣ i, a) b \in A max Q^{*} (j, b) .

Q^{*} (i, a) = r (i, a) + γ j = 1 \sum M p (j ∣ i, a) b \in A max Q^{*} (j, b) .

π (i) = ar g a \in A max Q^{*} (i, a) .

π (i) = ar g a \in A max Q^{*} (i, a) .

w^{*} = \frac{1}{1 - γ i , a min p ( i ∣ i , a )} .

w^{*} = \frac{1}{1 - γ i , a min p ( i ∣ i , a )} .

T_{w} (V) = w (T V) + (1 - w) V,

T_{w} (V) = w (T V) + (1 - w) V,

Q_{w} (i, a) = w (r (i, a) +

Q_{w} (i, a) = w (r (i, a) +

+ (1 - w) c \in A max Q_{w} (i, c),

a \in A max Q_{w} (i, a) = a \in A max Q^{*} (i, a), \leavevmode \forall i \in S .

a \in A max Q_{w} (i, a) = a \in A max Q^{*} (i, a), \leavevmode \forall i \in S .

Q_{n} (i, a) =

Q_{n} (i, a) =

+ (1 - w) c \in A max Q_{n - 1} (i, c), \leavevmode \forall (i, a) .

x_{n} = x_{n - 1} - J_{F}^{- 1} (x_{n - 1}) F (x_{n - 1}), \leavevmode n \geq 1,

x_{n} = x_{n - 1} - J_{F}^{- 1} (x_{n - 1}) F (x_{n - 1}), \leavevmode n \geq 1,

\displaystyle Q_{n}(i,a)=w\bigg{(}

\displaystyle Q_{n}(i,a)=w\bigg{(}

+ (1 - w) \frac{1}{N} lo g b = 1 \sum ∣ A ∣ e^{N Q_{n - 1} (i, b)}, \leavevmode n \geq 1,

\displaystyle UQ(i,a)=w\bigg{(}

\displaystyle UQ(i,a)=w\bigg{(}

+ (1 - w) \frac{1}{N} lo g b = 1 \sum ∣ A ∣ e^{N Q (i, b)} .

Q_{n} (i, a) = U Q_{n - 1} (i, a) .

Q_{n} (i, a) = U Q_{n - 1} (i, a) .

⎩ ⎨ ⎧ w γ p (k ∣ i, a) \frac{e ^{N Q_{n} (k, c)}}{b \in A \sum e ^{N Q_{n} (k, b)}} (1 - w + w γ p (k ∣ i, a)) \frac{e ^{N Q_{n} (k, c)}}{b \in A \sum e ^{N Q_{n} (k, b)}} (k, c) \neq = (i, a) (k, c) = (i, a)

⎩ ⎨ ⎧ w γ p (k ∣ i, a) \frac{e ^{N Q_{n} (k, c)}}{b \in A \sum e ^{N Q_{n} (k, b)}} (1 - w + w γ p (k ∣ i, a)) \frac{e ^{N Q_{n} (k, c)}}{b \in A \sum e ^{N Q_{n} (k, b)}} (k, c) \neq = (i, a) (k, c) = (i, a)

\displaystyle\big{|}f(x)-g_{N}(x)\big{|}

\displaystyle\big{|}f(x)-g_{N}(x)\big{|}

\displaystyle=\bigg{|}x_{i_{*}}-\frac{1}{N}\log\Big{[}\Big{(}\displaystyle\sum_{i=1}^{d}e^{N(x_{i}-x_{i_{*}})}\Big{)}e^{Nx_{i_{*}}}\Big{]}\bigg{|}

\displaystyle=\bigg{|}\frac{1}{N}\log\bigg{(}\displaystyle\sum_{i=1}^{d}e^{N(x_{i}-x_{i_{*}})}\bigg{)}\bigg{|}

\displaystyle\leq\bigg{|}\frac{\log d}{N}\bigg{|}\rightarrow 0\text{ as }N\rightarrow\infty.

\displaystyle(UQ)(i,a)=w\bigg{(}

\displaystyle(UQ)(i,a)=w\bigg{(}

+ \frac{( 1 - w )}{N} lo g b = 1 \sum ∣ A ∣ e^{N Q (i, b)} .

q (k ∣ i, a) = {\frac{w γ p ( k ∣ i , a )}{( 1 - w + w γ )}, \leavevmode k \neq = i, \frac{1 - w + w γ p ( i ∣ i , a )}{( 1 - w + w γ )}, \leavevmode k = i .

q (k ∣ i, a) = {\frac{w γ p ( k ∣ i , a )}{( 1 - w + w γ )}, \leavevmode k \neq = i, \frac{1 - w + w γ p ( i ∣ i , a )}{( 1 - w + w γ )}, \leavevmode k = i .

\displaystyle\big{|}(UP)(i,a)-(UQ)(i,a)\big{|}

\displaystyle\big{|}(UP)(i,a)-(UQ)(i,a)\big{|}

\displaystyle=(1-w+w\gamma)\Bigg{|}\mathbb{E}\Bigg{[}\frac{\log\displaystyle\sum_{b=1}^{|A|}e^{NP(j,b)}}{N}-\frac{\log\displaystyle\sum_{b=1}^{|A|}e^{NQ(j,b)}}{N}\Bigg{]}\Bigg{|}

\displaystyle=(1-w+w\gamma)\Bigg{|}\mathbb{E}\Bigg{[}\Bigg{(}\frac{e^{N\xi(j,.)}}{\displaystyle\sum_{b\in A}e^{N\xi(j,b)}}\Bigg{)}^{T}\Big{(}P(j,.)-Q(j,.)\Big{)}\Bigg{]}\Bigg{|}

\displaystyle\leq(1-w+w\gamma)\mathbb{E}\Bigg{[}\Bigg{|}\Bigg{(}\frac{e^{N\xi(j,.)}}{\displaystyle\sum_{b\in A}e^{N\xi(j,b)}}\Bigg{)}^{T}\Big{(}P(j,.)-Q(j,.)\Big{)}\Bigg{|}\Bigg{]}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Generalized Second Order Value Iteration in Markov Decision Processes

Chandramouli Kamanchi*∗, Raghuram Bharadwaj Diddigi∗*, Shalabh Bhatnagar ∗ Equal Contribution.The authors are with the Department of Computer Science and Automation, Indian Institute of Science (IISc), Bengaluru 560012, India. (E-mails: {chandramouli, raghub, shalabh}@iisc.ac.in).Raghuram Bharadwaj was supported by a fellowship grant from the Centre for Networked Intelligence (a Cisco CSR initiative) of the Indian Institute of Science, Bangalore. Shalabh Bhatnagar was supported by the J.C.Bose Fellowship, a project from DST under the ICPS Program and the RBCCPS, IISc.

Abstract

Value iteration is a fixed point iteration technique utilized to obtain the optimal value function and policy in a discounted reward Markov Decision Process (MDP). Here, a contraction operator is constructed and applied repeatedly to arrive at the optimal solution. Value iteration is a first order method and therefore it may take a large number of iterations to converge to the optimal solution. Successive relaxation is a popular technique that can be applied to solve a fixed point equation. It has been shown in the literature that, under a special structure of the MDP, successive over-relaxation technique computes the optimal value function faster than standard value iteration. In this work, we propose a second order value iteration procedure that is obtained by applying the Newton-Raphson method to the successive relaxation value iteration scheme. We prove the global convergence of our algorithm to the optimal solution asymptotically and show the second order convergence. Through experiments, we demonstrate the effectiveness of our proposed approach.

©2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. This paper is accepted for publication at IEEE Transactions on Automatic Control. DOI: 10.1109/TAC.2021.3112851

I Introduction

In a discounted reward Markov Decision Process [2], the objective is to maximize the expected cumulative discounted reward. Reinforcement Learning (RL) deals with the algorithms for solving an MDP problem when the model information (i.e., probability transition matrix and reward function) is unknown. RL algorithms instead make use of state and reward samples and estimate the optimal value function and policy. Due to the success of deep learning [12], RL algorithms in combination with deep neural networks have been successfully deployed to solve many real world problems and games [13]. However, there is ongoing research for improving the sample-efficiency as well as convergence of RL algorithms [10].

Many RL algorithms can be viewed as stochastic approximation [3] variants of the Bellman equation [1] in MDPs. For example, the popular Q-learning algorithm [19] can be viewed as a stochastic fixed point iteration to solve the Q-Bellman equation. Therefore, we believe that in order to improve the performance of RL algorithms, a promising first step would be to propose faster algorithms for solving MDPs when the model information is known. To this end, we propose a second order value iteration technique that has global convergence, which is a desirable property. In [17], the successive over-relaxation technique is applied to the Bellman equation to obtain a faster value iteration algorithm. In this work, we propose a Generalized Second Order Value Iteration (G-SOVI) method for computing the optimal value function and policy when the model information is known. This is achieved by the application of Newton-Raphson method to the Successive relaxation variant of the Q-Bellman Equation (henceforth denoted as SQBE). The key differences between G-SOVI and standard SOVI algorithms are the incorporation of relaxation parameter $w$ and the construction of single-stage reward as discussed in Section IV.

Note that we cannot directly apply the Newton-Raphson method to SQBE because of the presence of $\max(.)$ operator in the equation, which is not differentiable. Therefore, we approximate the $\max$ operator by a smooth function $g_{N}$ [14], where $N$ is a given parameter. This approximation allows us to apply the second order method thereby ensuring faster rate of convergence.

The solution obtained by our second order technique on the modified SQBE may be different from the solution of the original MDP problem because of the approximation of the $\max$ operator by $g_{N}$ . However, we show that our proposed algorithm converges to the actual solution as $N\xrightarrow{}\infty$ .

We show through numerical experiments that given a finite number of iterations, our proposed algorithm computes a solution that is closer to the actual solution faster when compared to that obtained from standard value iteration. Moreover, under a special structure of MDP, the solution is better than standard second-order value iteration [18].

II Related Work And Our Contributions

Value Iteration and Policy Iteration are two classical numerical techniques employed for solving MDP problems. In [16], it has been shown that Newton-Kantorovich111Also known as Newton-Raphson method method applied to the exact Bellman equation gives rise to the policy iteration scheme. In (Section 2.5 of [18]), a second-order value iteration technique (we refer to it as standard SOVI) is proposed by applying Newton-Kantorovich method to the smooth (soft-max) Bellman equation and remarks about second-order rate and global convergence are provided. However, convergence analysis is not discussed for the SOVI technique. Approximate Newton methods have been proposed in [6] for policy optimization in MDPs. A detailed analysis of the Hessian of the objective function is provided and algorithms are derived. In recent times, smooth Bellman equation has been successfully used in the development of many RL algorithms. For instance, in [9], a soft Q-learning algorithm has been proposed that learns the maximum entropy policies. The algorithm makes use of the smooth Q-Bellman equation with an additional entropy term. In [4], SBEED (Smoothed Bellman Error Embedding) algorithm has been proposed which computes the optimal policy by formulating the smooth Bellman equation as a primal-dual optimization problem. In [5], a matrix-gain learning algorithm namely Zap Q-learning has been proposed which is seen to have similar performance as the Newton-Raphson method. Very recently, an accelerated value iteration technique is proposed in [8] by applying Nestorov’s accelerated gradient technique to value iteration.

We now summarize the main contributions of our paper:

•

We propose a generalized second order Q-value iteration algorithm that is derived from the successive relaxation technique as well as the Newton-Raphson method. In fact, we show that standard SOVI is a special case of our proposed algorithm.

•

We prove the global convergence of our algorithm and provide a second order convergence rate result.

•

We derive a bound on the error defined in terms of the value function obtained by our proposed method and the actual value function and show that the error vanishes asymptotically.

•

Through experimental evaluation, we further confirm that our proposed technique provides a better near-optimal solution compared to that of the value iteration procedure when run for the same (finite) number of iterations.

III Background and Preliminaries

A discounted reward Markov Decision Process (MDP) is characterized via a tuple $(S,A,p,r,\gamma)$ where $S=\{1,2,\cdots,i,\ldots,j,\cdots,M\}$ denotes the set of states, $A=\{a_{1},\ldots,a_{K}\}$ denotes the set of actions, $p$ is the transition probability rule i.e., $p(j|i,a)$ denotes the probability of transition from state $i$ to state $j$ when action $a$ is chosen. Also, $r(i,a,j)$ denotes the single-stage reward obtained in state $i$ when action $a$ is chosen and the next state is $j$ . Finally, $0\leq\gamma<1$ denotes the discount factor. The objective in an MDP is to learn an optimal policy $\pi:S\xrightarrow{}A$ , where $\pi(i)$ denotes the action to be taken in state $i$ , that maximizes the cumulative discounted reward objective given by:

[TABLE]

In (1), $s_{t}$ is the state at time $t$ and $\mathbb{E}[.]$ is the expectation taken over the entire trajectory of states obtained over times $t=1,\ldots,\infty$ . Let $V^{*}(.)$ be the value function with $V^{*}(i)$ being the value of state $i$ that represents the total discounted reward obtained starting from state $i$ and following the optimal policy $\pi$ . The value function can be obtained by solving the Bellman equation [2] given by:

[TABLE]

We assume here for simplicity that all actions are feasible in every state. Value iteration is a popular numerical scheme employed to obtain the value function and so the optimal policy. It works as follows: An initial estimate of the value function $V_{0}$ is selected arbitrarily and a sequence of $V_{n},\leavevmode\nobreak\ n\geq 1$ is generated in an iterative fashion as below:

[TABLE]

Let $\zeta$ denote the set of all bounded functions from $S$ to $\mathbb{R}$ . Note that equation (2) can be rewritten as:

[TABLE]

where the operator $T:\zeta\xrightarrow{}\zeta$ is defined by:

[TABLE]

and $r(i,a)=\displaystyle\sum_{j=1}^{M}p(j|i,a)r(i,a,j)$ is the expected single-stage reward in state $i$ when action $a$ is chosen. It is easy to see that $T$ is a sup-norm contraction map with contraction factor $\gamma$ , i.e., the discount factor. Therefore, from the contraction mapping theorem, it is clear that the value iteration scheme given by equation (III) converges to the optimal value function, i.e.,

[TABLE]

Let $Q^{*}(i,a)$ with $(i,a)\in S\times A$ , be defined as:

[TABLE]

Here $Q^{*}(i,a)$ is the optimal Q-value function associated with state $i$ and action $a$ . It denotes the total discounted reward obtained starting from state $i$ upon taking action $a$ and following the optimal policy in subsequent states. Then from (2), it is clear that

[TABLE]

Therefore, the equation (6) can be re-written as follows:

[TABLE]

This is known as the Q-Bellman equation. We obtain the optimal policy by letting

[TABLE]

In [17], a modified value iteration algorithm is proposed based on the idea of successive relaxation. Let us define

[TABLE]

Note that $w^{*}\geq 1$ . For $0<w\leq w^{*},$ we define a modified Bellman operator as follows:

[TABLE]

where $w$ is called the ‘relaxation’ parameter. It is easy to see that the fixed point of $T_{w}$ is also the optimal value function of the MDP (fixed point of $T$ ). Moreover, it is shown in [17] that the contraction factor of $T_{w}$ is $1-\gamma+\gamma w$ . Under a special structure of the MDP, i.e., with $p(i|i,a)>0,\leavevmode\nobreak\ \forall i,a$ , we have $w^{*}>1$ (strictly greater than $1$ ). Then, the relaxation parameter $w$ can be chosen in three possible ways:

If $0<w<1,$ then the contractor factor of $T_{w}$ is more than the contraction factor of $T$ . 2. 2.

If $w=1,$ then $T=T_{w}$ and hence the contraction factors of both the operators are same. 3. 3.

If $1<w\leq w^{*},$ the contraction factor of $T_{w}$ is less than the contraction factor of $T$ . This implies that the fixed point iteration utilizing (11) generates the optimal value function faster than the standard value iteration.

In [11], a successive relaxation Q-Bellman equation (we call the Generalized Q-Bellman equation) is constructed as follows:

[TABLE]

where $0<w\leq w^{*}$ . It has been shown that, although the Q-values obtained by (III) can be different from the optimal Q-values, the optimal value functions are still the same. That is,

[TABLE]

The Generalized Q-values ( $Q_{w}$ in (III)) are obtained as follows. An initial estimate $Q_{0}$ of $Q_{w}$ is arbitrarily selected and a sequence of $Q_{n},\leavevmode\nobreak\ n\geq 1$ is obtained according to:

[TABLE]

It is shown in [11] that, the Q-values obtained by (III) converge to the generalized Q-values $Q_{w}$ . In this way, we obtain optimal value function and optimal policy using the successive relaxation Q-value iteration scheme. In this work, our objective is to approximate the generalized Q-Bellman equation and apply the Newton-Raphson second order technique to solve for the optimal value function. Recall that we cannot apply the second order method directly to the equation (III) as the $\max(.)$ operator on the RHS is not differentiable. Before we propose our algorithm, we briefly discuss the Newton’s second order technique [15] for solving a non-linear system of equations.

Consider a function $F:\mathbb{R}^{d}\xrightarrow{}\mathbb{R}^{d}$ that is twice differentiable. Suppose we are interested in finding a root of $F$ i.e., a point $x$ such that $F(x)=0$ . The Newton-Raphson method can be applied to find a solution here. We select an initial point $x_{0}$ and then proceed as follows:

[TABLE]

where $J_{F}(x)$ is the Jacobian of the function $F$ evaluated at point $x$ . Under suitable hypotheses it can be shown that the procedure (15) leads to second order convergence.

In the next section, we construct a function $F$ for our problem and apply the Newton-Raphson method to find the optimal value function and policy pair.

IV Proposed Algorithm

We construct our modified SQBE as follows. We first approximate the $\max(.)$ operator, i.e., the function $f(x)=\max^{d}_{i=1}{x_{i}}$ , where $x=(x_{1},\ldots,x_{d})$ , with $g_{N}(x)=\frac{1}{N}\log\displaystyle\sum^{d}_{i=1}e^{Nx_{i}},$ as the $\max(.)$ operator is not differentiable. We note here that $g_{N}(x)$ is a smooth approximation of $\max$ operator $f(x)$ as shown in the Lemma 3. Then the equation (III) can be rewritten as follows:

[TABLE]

starting with an initial $Q_{0}$ (arbitrarily chosen in general). Therefore our modified Successive Q-Bellman (SQB) operator $U:\mathbb{R}^{|S|\times|A|}\xrightarrow{}\mathbb{R}^{|S|\times|A|}$ is defined as follows. For $0<w\leq w^{*}$ ,

[TABLE]

The numerical scheme (IV) is thus

[TABLE]

Finally, by an application of the Newton-Raphson method to $U$ , our Generalized Second Order Value Iteration (G-SOVI) is obtained as described in Algorithm 1. Note that setting $w=1$ in Step 4 of the algorithm yields the standard SOVI algorithm.

Remark 1.

Note that in our case, the function $F$ in equation (15) corresponds to $F(Q)=Q-UQ$ and $J_{F}(Q)=I-J_{U}(Q)$ is a $|S\times A|\times|S\times A|$ dimensional matrix.

Remark 2.

Note that directly computing $\big{(}I-J_{U}(Q)\big{)}^{-1}(Q-UQ)$ would involve $O(|S|^{3}|A|^{3})$ complexity. This computation could be carried out by solving the system $(I-J_{U}(Q))Y=Q-UQ$ for $Y$ to avoid numerical stability issues. Moreover the per-iteration time complexity of the Algorithm 1 is also $O(|S|^{3}|A|^{3})$ .

Remark 3.

Note that G-SOVI reduces to standard SOVI in the case $w=1.$ Moreover the computational complexity for both the algorithms is the same.

Remark 4.

The space required for storing the Jacobian matrix $J_{U}$ is $|S|^{2}|A|^{2}$ . Hence the space complexity of Algorithm1 is $O(|S|^{2}|A|^{2})$ .

V Convergence Analysis

In this section we study the convergence analysis of our algorithm. Note that the norm considered in the following analysis is the max-norm, i.e., $\|x\|:=\max_{1\leq i\leq d}|x_{i}|.$ Throughout this section, it is assumed that the relaxation parameter $w$ satisfies $0<w\leq w^{*},$ where $w^{*}$ is as defined in (10).

Lemma 1.

Suppose $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ and $f(x):=\displaystyle\max\{x_{1},x_{2},\cdots,x_{d}\}.$ Let $g_{N}:\mathbb{R}^{d}\rightarrow\mathbb{R}$ be defined as $g_{N}(x):=\frac{1}{N}\log\displaystyle\sum_{i=1}^{d}e^{Nx_{i}}.$ Then $\displaystyle\sup_{x\in\mathbb{R}^{d}}\big{|}f(x)-g_{N}(x)\big{|}\longrightarrow 0$ as $N\longrightarrow\infty.$

*Proof: *Let $x_{i_{*}}=\max\{x_{1},x_{2},\cdots,x_{d}\}$ (where $i_{*}$ denotes the corresponding $\arg\max$ ). Now

[TABLE]

Note that the inequality follows from the definition of $x_{i_{*}}=\max\{x_{1},x_{2},\cdots,x_{d}\}$ and the fact that $e^{N(x_{i}-x_{i_{*}})}\leq 1\text{ for }1\leq i\leq d$ (since $x_{i}\leq x_{i^{*}}$ $\forall i$ ). Hence $\displaystyle\sup_{x\in\mathbb{R}^{d}}\big{|}f(x)-g_{N}(x)\big{|}\rightarrow 0$ as $N\rightarrow\infty$ with the rate $\frac{1}{N}.$

Lemma 2.

Let $U:\mathbb{R}^{|S|\times|A|}\rightarrow\mathbb{R}^{|S|\times|A|}$ be defined as follows.

[TABLE]

Then $U$ is a max-norm contraction.

*Proof: *Given $(i,a)$ , let $q(.|i,a):\{1,2,\cdots,|S|\}\rightarrow[0,1]$ be defined as follows.

[TABLE]

Observe that $q$ is a probability mass function. Let $\mathbb{E}[.]$ denote the expectation with respect to $q$ , and $\xi(j,.)$ denotes the point that lies on the line joining $P(j,.)$ and $Q(j,.)$ Now for $P,Q\in\mathbb{R}^{|S|\times|A|}$ , we have

[TABLE]

Hence $U$ is a contraction with contraction factor $(1-w+w\gamma)$ . Here, the second equality follows from an application of mean value theorem in multivariate calculus.

Lemma 3.

Let $Q_{w}$ be as in equation (III) and $Q^{\prime}$ be fixed point of $U$ respectively. Then $\|Q_{w}-Q^{\prime}\|\leq\frac{(1-w+w\gamma)}{Nw(1-\gamma)}\log(|A|).$

*Proof: *From equation (III), we have

[TABLE]

Now $Q^{\prime}$ is the unique fixed point of $U$ (unique by virtue of Lemma 2), so

[TABLE]

where $Z$ is a random variable with probability mass function as $q$ and the expectation above is taken with respect to the law given by probability mass function $q$ . Let $c=\arg\max_{b\in A}Q^{\prime}(Z,b)$ i.e. $Q^{\prime}(Z,c)=\max_{b\in A}Q^{\prime}(Z,b).$ Now

[TABLE]

This completes the proof. This lemma shows that the approximation error $\|Q_{w}-Q^{\prime}\|\rightarrow 0$ as $N\rightarrow\infty.$

Remark 5.

It is easy to see that in the case of $w>1$

[TABLE]

This shows that the approximation error of G-SOVI is smaller than standard SOVI in the case of $w>1$ .

We now invoke the following theorem from [15] to show the global convergence of our second order value iteration.

Theorem 1 (Global Newton Theorem).

Suppose that $F:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ is continuous, component-wise concave on $\mathbb{R}^{d}$ , differentiable and that $F^{\prime}(x)$ is non-singular and $F^{\prime}(x)^{-1}\geq 0$ , i.e. each entry of $F^{\prime}(x)^{-1}$ is non-negative, for all $x\in\mathbb{R}^{d}.$ Assume, further, that $F(x)=0$ has a unique solution $x^{*}$ and that $F^{\prime}$ is continuous on $\mathbb{R}^{d}.$ Then for any $x_{0}\in\mathbb{R}^{d}$ the Newton iterates given by (15) converge to $x^{*}.$

Remark 6.

Note that the above theorem is stated for convex $F$ in [15]. However, the theorem holds true even for concave $F$ .

Theorem 2.

Let $Q^{\prime}$ be the fixed point of the operator $U$ . G-SOVI converges to $Q^{\prime}$ for any choice of initial point $Q_{0}.$

*Proof: *G-SOVI computes the zeros of the equation $Q-UQ=0.$ So we appeal to Theorem 1 with the choice of $F$ as $I-U:\mathbb{R}^{|S|\times|A|}\rightarrow\mathbb{R}^{|S|\times|A|}$ where $(I-U)(Q)(i,a)=Q(i,a)-wr(i,a)-(1-w+\gamma w)\mathbb{E}\bigg{[}\frac{1}{N}\log\displaystyle\sum_{b\in A}e^{NQ(Z,b)}\bigg{]}.$ It is enough to verify the hypothesis of Theorem 1 for $I-U.$ Clearly $I-U$ is continuous, component-wise concave and differentiable with $(I-U)^{\prime}(Q)=I-J_{U}(Q)$ where

[TABLE]

•

each entry in the $(i,a)^{\text{th}}$ row is non-negative.

•

the sum of the entries in the $(i,a)^{\text{th}}$ row is

[TABLE]

So $J_{U}(Q)=(1-w+w\gamma)\Phi$ for a $|S\times A|\times|S\times A|$ dimensional transition probability matrix $\Phi$ . It is easy to see that $(I-J_{U}(Q))^{-1}$ exists (see Lemma 4) with the power series expansion

[TABLE]

Moreover, since each entry in $\Phi$ is non-negative, $\Phi\geq 0$ . Hence $\big{(}I-J_{U}(Q)\big{)}^{-1}\geq 0.$ Also from lemma 2 it is clear that the equation $Q-UQ=0$ has a unique solution. This completes the proof.

Lemma 4.

$\left\|\big{(}I-J_{U}(Q)\big{)}^{-1}\right\|\leq\frac{1}{w(1-\gamma)}$ **

*Proof: *Note that

[TABLE]

for a given transition probability matrix $\Phi$ . Now suppose that $\lambda$ is an eigen-value of $I-(1-w+w\gamma)\Phi$ then

[TABLE]

From $1-(1-w+w\gamma)>0$ , we have

[TABLE]

where $\sigma(I-J_{U}(Q))$ is the spectrum of $I-J_{U}(Q)$ . Hence for any $Q$ , $\big{(}I-J_{U}(Q)\big{)}^{-1}$ exists and we have

[TABLE]

This completes the proof.

The following theorem is an adaptation from [15].

Theorem 3.

G-SOVI has second order convergence.

*Proof: *Recall that $F(Q)=Q-UQ.$ Let $Q^{*}$ be the unique solution of $F(Q)=0$ and $\{Q_{n}\}$ be the sequence of iterates generated by G-SOVI. Define $e_{n}=\|Q_{n}-Q^{*}\|$ and $G(Q)=Q-F^{\prime}(Q)^{-1}F(Q)$ . As $Q^{*}$ satisfies $Q^{*}=UQ^{*}$ , it is a fixed point of $G$ . It is enough to show that $e_{n+1}\leq ke^{2}_{n}$ for a constant $k.$ It can be shown that for our particular choice of $F$ , $F^{\prime}$ is Lipschitz (with Lipschitz constant, say, $L$ ).

[TABLE]

Utilizing the above properties we have

[TABLE]

where $\beta=L\|F^{\prime}(Q)\|^{-1}\leq\frac{L}{w(1-\gamma)}$ (from Lemma 4) and $k=\frac{3}{2}\beta$ .

VI Experiments

In this section, we describe the experimental results of our proposed G-SOVI algorithm and compare the same with standard SOVI and value iteration. For this purpose, we use python MDP toolbox [7] for generating the MDP and implementing standard value iteration 222The code for our experiments is available at: https://github.com/raghudiddigi/G-SOVI. The generated MDPs satisfy $p(i|i,a)>0,\leavevmode\nobreak\ \forall i,a$ in order to ensure that $w^{*}>1$ . We consider the error as defined below to be the metric for comparison between algorithms. Error for a given algorithm at iteration $i$ , denoted $E(i)$ , is calculated as follows. We collect the max-norm difference between the optimal value function and the value function estimate at iteration $i$ . That is,

[TABLE]

where $V^{*}$ is the optimal value function of the MDP and $Q^{i}(.,.)$ is the Q-value function estimate of MDP at iteration $i$ . Also, for any state $j$ , $a^{*}_{j}=\displaystyle\arg\max_{a\in A}Q^{i}(j,a)$ .

First, we generate $100$ independent MDPs each with $10$ states, $5$ actions and we set the discount factor to be $0.9$ in each case. We run all the algorithms for $50$ iterations. The initial Q-values of the algorithms are assigned random integers between 10 and 20 (which are far away from the optimal value function). In Table I, we indicate the average error value (error averaged over $100$ MDPs) at the end of $50$ iterations for all the algorithms, wherein for G-SOVI, we set $w=w^{*}$ as the relaxation parameter. We observe that standard SOVI and G-SOVI with $N=25,30,$ and $35$ have low error at the end of $50$ iterations compared to the standard value iteration. Moreover, the average error is the least for our proposed G-SOVI algorithm. Also, we find that higher the value of $N$ , the smaller is the error between the G-SOVI value function and the optimal value function.

In Table II, we report the performance of G-SOVI for different values of feasible successive relaxation parameters $w$ across the same $100$ MDPs generated previously (in Table I). The optimal successive relaxation parameter $w^{*}$ here lies between $1.1$ and $1.5$ . Recall that G-SOVI exhibits faster convergence for any value of $w$ that satisfies $1<w\leq w^{*}$ when compared to standard SOVI (first row of Table II). From Table II, we can conclude that G-SOVI with $w\in(1,w^{*}]$ performs at least as fast as the standard SOVI. Moreover, the higher the value of $w$ , the better is the performance, when the algorithm is run for a sufficient number of iterations.

In Table III, we present the results of the three algorithms on four different settings, averaged over $10$ MDPs. The standard SOVI and G-SOVI are run with $N=35$ . All the algorithms are run for $10$ iterations. We observe that standard SOVI and G-SOVI have low error compared to the standard value iteration. Moreover, the difference here is much more pronounced than in Table I, where algorithms are run for $50$ iterations. Recall that the SOVI and G-SOVI algorithms with a fixed $N$ give near-optimal value functions. The advantage of using our proposed algorithms is that the Q-value iterates converge to the near-optimal Q-values rapidly. This can also be observed in Figure 1, where we present the convergence of algorithms over $50$ iterations on $100$ states and $10$ actions setting. The SOVI and G-SOVI algorithms converge rapidly to a value and stay constant. In fact, we observe here that the error is less than that obtained by the standard value iteration till $45$ iterations. Moreover, G-SOVI computes a solution that gives lower error as compared to that obtained by SOVI.

In Table IV, we indicate the per-iteration execution time of our algorithms across the four settings considered above. We can see that, due to Hessian inversion operation in the second-order techniques, standard and G-SOVI algorithms take more time compared to the standard value iteration algorithm.

Recall that the advantage of second-order methods is that even though the per-iteration computation is higher compared to the first-order methods, the total number of iterations needed to achieve a desired error threshold is much lower in general. Hence, they are capable of achieving lower error in the same computational time. We demonstrate this in Table V for three settings. We select the parameters of this experiment (i.e., number of states and actions, values of $N$ , $w$ , number of iterations), such that the second-order methods compute better solutions compared to the standard value iteration scheme333The value of $w=1.00001$ respects the constraint $w\leq w^{*}$ in all the three settings.. For example, consider the $10$ states and $5$ actions setting (first row of Table V). The standard value iteration is run for $50$ iterations. It’s per-iteration time is $0.2$ ms which results in an overall computational time of $0.0002\times 50=0.01$ seconds. On the other hand, SOVI techniques (standard SOVI and the G-SOVI) are run for just $3$ iterations. However, their per-iteration time is $0.0033$ seconds and hence the overall computational time is $0.01$ seconds. We observe that, in $0.01$ seconds, the SOVI methods achieve lower error compared to the standard value iteration. Similarly, in the other two settings in Table V, we see that the second order SOVI algorithms achieve lower error compared to the standard value iteration when run for $0.02$ and $0.03$ seconds, respectively.

It is important to note that this advantage need not hold for MDPs, in general, with large number of states and actions as the overhead for computing the Hessian inverse in large MDPs will be higher that would affect the overall computational time. If one could deploy techniques to improve the computation time for matrix operations, G-SOVI would be preferred for computing the optimal value function, over the standard value iteration.

VII Conclusion

In this work, we have proposed a generalized second-order value iteration scheme based on the Newton-Raphson method for faster convergence to near optimal value function in discounted reward MDP problems. The first step involved constructing a differentiable Bellman equation through an approximation of the $\max(.)$ operator. We then applied second order Newton method to arrive at the proposed algorithm. We proved the bounds on approximation error and showed second order convergence to the optimal value function. Finally, approaches geared towards easing the computational burden associated with solving problems involving large state and action spaces such as those based on approximate dynamic programming can be developed in the context of G-SOVI schemes in the future.

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Richard Bellman. Dynamic programming. Science , 153(3731):34–37, 1966.
2[2] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming , volume 5. Athena Scientific, Belmont, MA, 1996.
3[3] Vivek S Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint . Cambridge Univ. Press, 2008.
4[4] Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. ar Xiv preprint ar Xiv:1712.10285 , 2017.
5[5] Adithya M Devraj and Sean Meyn. Zap Q-learning. In Advances in Neural Information Processing Systems , pages 2235–2244, 2017.
6[6] Thomas Furmston, Guy Lever, and David Barber. Approximate Newton methods for policy search in Markov decision processes. The Journal of Machine Learning Research , 17(1):8055–8105, 2016.
7[7] Github. Python MD Ptoolbox. https://github.com/sawcordwell/pymdptoolbox .
8[8] Vineet Goyal and Julien Grand-Clement. A first-order approach to accelerated value iteration. ar Xiv preprint ar Xiv:1905.09963 , 2019.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Generalized Second Order Value Iteration in Markov Decision Processes

Abstract

I Introduction

II Related Work And Our Contributions

III Background and Preliminaries

IV Proposed Algorithm

Remark 1**.**

Remark 2**.**

Remark 3**.**

Remark 4**.**

V Convergence Analysis

Lemma 1**.**

Lemma 2**.**

Lemma 3**.**

Remark 5**.**

Theorem 1** (Global Newton Theorem).**

Remark 6**.**

Theorem 2**.**

Lemma 4**.**

Theorem 3**.**

VI Experiments

VII Conclusion

Remark 1.

Remark 2.

Remark 3.

Remark 4.

Lemma 1.

Lemma 2.

Lemma 3.

Remark 5.

Theorem 1 (Global Newton Theorem).

Remark 6.

Theorem 2.

Lemma 4.

Theorem 3.