TL;DR
This paper introduces a second order value iteration method for Markov Decision Processes that accelerates convergence to the optimal value function by applying Newton-Raphson to the successive relaxation scheme, with proven convergence and demonstrated effectiveness.
Contribution
It proposes a novel second order value iteration algorithm based on Newton-Raphson applied to successive relaxation, improving convergence speed in MDPs.
Findings
Proves global convergence of the method
Demonstrates second order convergence rate
Shows improved efficiency through experiments
Abstract
Value iteration is a fixed point iteration technique utilized to obtain the optimal value function and policy in a discounted reward Markov Decision Process (MDP). Here, a contraction operator is constructed and applied repeatedly to arrive at the optimal solution. Value iteration is a first order method and therefore it may take a large number of iterations to converge to the optimal solution. Successive relaxation is a popular technique that can be applied to solve a fixed point equation. It has been shown in the literature that, under a special structure of the MDP, successive over-relaxation technique computes the optimal value function faster than standard value iteration. In this work, we propose a second order value iteration procedure that is obtained by applying the Newton-Raphson method to the successive relaxation value iteration scheme. We prove the global convergence of our…
| Value of N | Standard Value | Standard | G-SOVI |
|---|---|---|---|
| Iteration | SOVI | ||
| N=20 | 0.1009 0.0026 | 0.1205 0.0372 | 0.1093 0.0818 |
| N=25 | 0.0822 0.0273 | 0.0648 0.0217 | |
| N=30 | 0.0611 0.0211 | 0.0494 0.017 | |
| N=35 | 0.0484 0.0168 | 0.0397 0.0136 |
| Value of | G-SOVI | ||
|---|---|---|---|
|
0.04838 0.017 | ||
| 0.04838 0.017 | |||
| 0.04837 0.017 | |||
| 0.04830 0.017 | |||
| 0.0476 0.017 | |||
| 0.0448 0.016 | |||
| 0.0417 0.014 | |||
| 0.0397 0.014 |
| Setting |
|
Standard SOVI | G-SOVI | ||
|---|---|---|---|---|---|
| States = 30, Actions= 10 | 6.471 0.07 | 0.087 0.01 | 0.079 0.01 | ||
| States = 50, Actions = 10 | 6.587 0.07 | 0.114 0.01 | 0.108 0.01 | ||
| States = 80, Actions = 10 | 6.754 0.03 | 0.141 0.01 | 0.136 0.01 | ||
| States = 100, Actions = 10 | 6.772 0.03 | 0.152 0.01 | 0.148 0.01 |
| Setting |
|
Standard SOVI | G-SOVI | ||
|---|---|---|---|---|---|
| States = 30, Actions= 10 | 0.0008 0.00 | 0.0154 0.01 | 0.0267 0.01 | ||
| States = 50, Actions = 10 | 0.0009 0.00 | 0.0242 0.00 | 0.0488 0.00 | ||
| States = 80, Actions = 10 | 0.0011 0.00 | 0.0532 0.00 | 0.0988 0.01 | ||
| States = 100, Actions = 10 | 0.0026 0.00 | 0.1202 0.01 | 0.1343 0.01 |
| Configuration |
|
|
Standard SOVI | G-SOVI | ||||
|---|---|---|---|---|---|---|---|---|
| 10 States, 5 Actions | 0.01 | 25.485 2.21 | 3.930 0.92 | 3.885 0.94 | ||||
| 20 States, 5 Actions | 0.02 | 18.291 0.77 | 5.444 0.51 | 5.473 0.50 | ||||
| 30 States, 5 Actions | 0.03 | 7.327 0.20 | 7.111 0.32 | 7.118 0.33 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Generalized Second Order Value Iteration in Markov Decision Processes
Chandramouli Kamanchi*∗, Raghuram Bharadwaj Diddigi∗*, Shalabh Bhatnagar ∗ Equal Contribution.The authors are with the Department of Computer Science and Automation, Indian Institute of Science (IISc), Bengaluru 560012, India. (E-mails: {chandramouli, raghub, shalabh}@iisc.ac.in).Raghuram Bharadwaj was supported by a fellowship grant from the Centre for Networked Intelligence (a Cisco CSR initiative) of the Indian Institute of Science, Bangalore. Shalabh Bhatnagar was supported by the J.C.Bose Fellowship, a project from DST under the ICPS Program and the RBCCPS, IISc.
Abstract
Value iteration is a fixed point iteration technique utilized to obtain the optimal value function and policy in a discounted reward Markov Decision Process (MDP). Here, a contraction operator is constructed and applied repeatedly to arrive at the optimal solution. Value iteration is a first order method and therefore it may take a large number of iterations to converge to the optimal solution. Successive relaxation is a popular technique that can be applied to solve a fixed point equation. It has been shown in the literature that, under a special structure of the MDP, successive over-relaxation technique computes the optimal value function faster than standard value iteration. In this work, we propose a second order value iteration procedure that is obtained by applying the Newton-Raphson method to the successive relaxation value iteration scheme. We prove the global convergence of our algorithm to the optimal solution asymptotically and show the second order convergence. Through experiments, we demonstrate the effectiveness of our proposed approach.
©2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. This paper is accepted for publication at IEEE Transactions on Automatic Control. DOI: 10.1109/TAC.2021.3112851
I Introduction
In a discounted reward Markov Decision Process [2], the objective is to maximize the expected cumulative discounted reward. Reinforcement Learning (RL) deals with the algorithms for solving an MDP problem when the model information (i.e., probability transition matrix and reward function) is unknown. RL algorithms instead make use of state and reward samples and estimate the optimal value function and policy. Due to the success of deep learning [12], RL algorithms in combination with deep neural networks have been successfully deployed to solve many real world problems and games [13]. However, there is ongoing research for improving the sample-efficiency as well as convergence of RL algorithms [10].
Many RL algorithms can be viewed as stochastic approximation [3] variants of the Bellman equation [1] in MDPs. For example, the popular Q-learning algorithm [19] can be viewed as a stochastic fixed point iteration to solve the Q-Bellman equation. Therefore, we believe that in order to improve the performance of RL algorithms, a promising first step would be to propose faster algorithms for solving MDPs when the model information is known. To this end, we propose a second order value iteration technique that has global convergence, which is a desirable property. In [17], the successive over-relaxation technique is applied to the Bellman equation to obtain a faster value iteration algorithm. In this work, we propose a Generalized Second Order Value Iteration (G-SOVI) method for computing the optimal value function and policy when the model information is known. This is achieved by the application of Newton-Raphson method to the Successive relaxation variant of the Q-Bellman Equation (henceforth denoted as SQBE). The key differences between G-SOVI and standard SOVI algorithms are the incorporation of relaxation parameter and the construction of single-stage reward as discussed in Section IV.
Note that we cannot directly apply the Newton-Raphson method to SQBE because of the presence of operator in the equation, which is not differentiable. Therefore, we approximate the operator by a smooth function [14], where is a given parameter. This approximation allows us to apply the second order method thereby ensuring faster rate of convergence.
The solution obtained by our second order technique on the modified SQBE may be different from the solution of the original MDP problem because of the approximation of the operator by . However, we show that our proposed algorithm converges to the actual solution as .
We show through numerical experiments that given a finite number of iterations, our proposed algorithm computes a solution that is closer to the actual solution faster when compared to that obtained from standard value iteration. Moreover, under a special structure of MDP, the solution is better than standard second-order value iteration [18].
II Related Work And Our Contributions
Value Iteration and Policy Iteration are two classical numerical techniques employed for solving MDP problems. In [16], it has been shown that Newton-Kantorovich111Also known as Newton-Raphson method method applied to the exact Bellman equation gives rise to the policy iteration scheme. In (Section 2.5 of [18]), a second-order value iteration technique (we refer to it as standard SOVI) is proposed by applying Newton-Kantorovich method to the smooth (soft-max) Bellman equation and remarks about second-order rate and global convergence are provided. However, convergence analysis is not discussed for the SOVI technique. Approximate Newton methods have been proposed in [6] for policy optimization in MDPs. A detailed analysis of the Hessian of the objective function is provided and algorithms are derived. In recent times, smooth Bellman equation has been successfully used in the development of many RL algorithms. For instance, in [9], a soft Q-learning algorithm has been proposed that learns the maximum entropy policies. The algorithm makes use of the smooth Q-Bellman equation with an additional entropy term. In [4], SBEED (Smoothed Bellman Error Embedding) algorithm has been proposed which computes the optimal policy by formulating the smooth Bellman equation as a primal-dual optimization problem. In [5], a matrix-gain learning algorithm namely Zap Q-learning has been proposed which is seen to have similar performance as the Newton-Raphson method. Very recently, an accelerated value iteration technique is proposed in [8] by applying Nestorov’s accelerated gradient technique to value iteration.
We now summarize the main contributions of our paper:
- •
We propose a generalized second order Q-value iteration algorithm that is derived from the successive relaxation technique as well as the Newton-Raphson method. In fact, we show that standard SOVI is a special case of our proposed algorithm.
- •
We prove the global convergence of our algorithm and provide a second order convergence rate result.
- •
We derive a bound on the error defined in terms of the value function obtained by our proposed method and the actual value function and show that the error vanishes asymptotically.
- •
Through experimental evaluation, we further confirm that our proposed technique provides a better near-optimal solution compared to that of the value iteration procedure when run for the same (finite) number of iterations.
III Background and Preliminaries
A discounted reward Markov Decision Process (MDP) is characterized via a tuple where denotes the set of states, denotes the set of actions, is the transition probability rule i.e., denotes the probability of transition from state to state when action is chosen. Also, denotes the single-stage reward obtained in state when action is chosen and the next state is . Finally, denotes the discount factor. The objective in an MDP is to learn an optimal policy , where denotes the action to be taken in state , that maximizes the cumulative discounted reward objective given by:
[TABLE]
In (1), is the state at time and is the expectation taken over the entire trajectory of states obtained over times . Let be the value function with being the value of state that represents the total discounted reward obtained starting from state and following the optimal policy . The value function can be obtained by solving the Bellman equation [2] given by:
[TABLE]
We assume here for simplicity that all actions are feasible in every state. Value iteration is a popular numerical scheme employed to obtain the value function and so the optimal policy. It works as follows: An initial estimate of the value function is selected arbitrarily and a sequence of is generated in an iterative fashion as below:
[TABLE]
Let denote the set of all bounded functions from to . Note that equation (2) can be rewritten as:
[TABLE]
where the operator is defined by:
[TABLE]
and is the expected single-stage reward in state when action is chosen. It is easy to see that is a sup-norm contraction map with contraction factor , i.e., the discount factor. Therefore, from the contraction mapping theorem, it is clear that the value iteration scheme given by equation (III) converges to the optimal value function, i.e.,
[TABLE]
Let with , be defined as:
[TABLE]
Here is the optimal Q-value function associated with state and action . It denotes the total discounted reward obtained starting from state upon taking action and following the optimal policy in subsequent states. Then from (2), it is clear that
[TABLE]
Therefore, the equation (6) can be re-written as follows:
[TABLE]
This is known as the Q-Bellman equation. We obtain the optimal policy by letting
[TABLE]
In [17], a modified value iteration algorithm is proposed based on the idea of successive relaxation. Let us define
[TABLE]
Note that . For we define a modified Bellman operator as follows:
[TABLE]
where is called the ‘relaxation’ parameter. It is easy to see that the fixed point of is also the optimal value function of the MDP (fixed point of ). Moreover, it is shown in [17] that the contraction factor of is . Under a special structure of the MDP, i.e., with , we have (strictly greater than ). Then, the relaxation parameter can be chosen in three possible ways:
If then the contractor factor of is more than the contraction factor of . 2. 2.
If then and hence the contraction factors of both the operators are same. 3. 3.
If the contraction factor of is less than the contraction factor of . This implies that the fixed point iteration utilizing (11) generates the optimal value function faster than the standard value iteration.
In [11], a successive relaxation Q-Bellman equation (we call the Generalized Q-Bellman equation) is constructed as follows:
[TABLE]
where . It has been shown that, although the Q-values obtained by (III) can be different from the optimal Q-values, the optimal value functions are still the same. That is,
[TABLE]
The Generalized Q-values ( in (III)) are obtained as follows. An initial estimate of is arbitrarily selected and a sequence of is obtained according to:
[TABLE]
It is shown in [11] that, the Q-values obtained by (III) converge to the generalized Q-values . In this way, we obtain optimal value function and optimal policy using the successive relaxation Q-value iteration scheme. In this work, our objective is to approximate the generalized Q-Bellman equation and apply the Newton-Raphson second order technique to solve for the optimal value function. Recall that we cannot apply the second order method directly to the equation (III) as the operator on the RHS is not differentiable. Before we propose our algorithm, we briefly discuss the Newton’s second order technique [15] for solving a non-linear system of equations.
Consider a function that is twice differentiable. Suppose we are interested in finding a root of i.e., a point such that . The Newton-Raphson method can be applied to find a solution here. We select an initial point and then proceed as follows:
[TABLE]
where is the Jacobian of the function evaluated at point . Under suitable hypotheses it can be shown that the procedure (15) leads to second order convergence.
In the next section, we construct a function for our problem and apply the Newton-Raphson method to find the optimal value function and policy pair.
IV Proposed Algorithm
We construct our modified SQBE as follows. We first approximate the operator, i.e., the function , where , with as the operator is not differentiable. We note here that is a smooth approximation of operator as shown in the Lemma 3. Then the equation (III) can be rewritten as follows:
[TABLE]
starting with an initial (arbitrarily chosen in general). Therefore our modified Successive Q-Bellman (SQB) operator is defined as follows. For ,
[TABLE]
The numerical scheme (IV) is thus
[TABLE]
Finally, by an application of the Newton-Raphson method to , our Generalized Second Order Value Iteration (G-SOVI) is obtained as described in Algorithm 1. Note that setting in Step 4 of the algorithm yields the standard SOVI algorithm.
Remark 1**.**
Note that in our case, the function in equation (15) corresponds to and is a dimensional matrix.
Remark 2**.**
Note that directly computing \big{(}I-J_{U}(Q)\big{)}^{-1}(Q-UQ) would involve complexity. This computation could be carried out by solving the system for to avoid numerical stability issues. Moreover the per-iteration time complexity of the Algorithm 1 is also .
Remark 3**.**
Note that G-SOVI reduces to standard SOVI in the case Moreover the computational complexity for both the algorithms is the same.
Remark 4**.**
The space required for storing the Jacobian matrix is . Hence the space complexity of Algorithm1 is .
V Convergence Analysis
In this section we study the convergence analysis of our algorithm. Note that the norm considered in the following analysis is the max-norm, i.e., Throughout this section, it is assumed that the relaxation parameter satisfies where is as defined in (10).
Lemma 1**.**
Suppose and Let be defined as Then \displaystyle\sup_{x\in\mathbb{R}^{d}}\big{|}f(x)-g_{N}(x)\big{|}\longrightarrow 0 as
*Proof: *Let (where denotes the corresponding ). Now
[TABLE]
Note that the inequality follows from the definition of and the fact that (since ). Hence \displaystyle\sup_{x\in\mathbb{R}^{d}}\big{|}f(x)-g_{N}(x)\big{|}\rightarrow 0 as with the rate
Lemma 2**.**
Let be defined as follows.
[TABLE]
Then is a max-norm contraction.
*Proof: *Given , let be defined as follows.
[TABLE]
Observe that is a probability mass function. Let denote the expectation with respect to , and denotes the point that lies on the line joining and Now for , we have
[TABLE]
Hence is a contraction with contraction factor . Here, the second equality follows from an application of mean value theorem in multivariate calculus.
Lemma 3**.**
Let be as in equation (III) and be fixed point of respectively. Then
*Proof: *From equation (III), we have
[TABLE]
Now is the unique fixed point of (unique by virtue of Lemma 2), so
[TABLE]
where is a random variable with probability mass function as and the expectation above is taken with respect to the law given by probability mass function . Let i.e. Now
[TABLE]
[TABLE]
[TABLE]
This completes the proof. This lemma shows that the approximation error as
Remark 5**.**
It is easy to see that in the case of
[TABLE]
This shows that the approximation error of G-SOVI is smaller than standard SOVI in the case of .
We now invoke the following theorem from [15] to show the global convergence of our second order value iteration.
Theorem 1** (Global Newton Theorem).**
Suppose that is continuous, component-wise concave on , differentiable and that is non-singular and , i.e. each entry of is non-negative, for all Assume, further, that has a unique solution and that is continuous on Then for any the Newton iterates given by (15) converge to
Remark 6**.**
Note that the above theorem is stated for convex in [15]. However, the theorem holds true even for concave .
Theorem 2**.**
Let be the fixed point of the operator . G-SOVI converges to for any choice of initial point
*Proof: *G-SOVI computes the zeros of the equation So we appeal to Theorem 1 with the choice of as where (I-U)(Q)(i,a)=Q(i,a)-wr(i,a)-(1-w+\gamma w)\mathbb{E}\bigg{[}\frac{1}{N}\log\displaystyle\sum_{b\in A}e^{NQ(Z,b)}\bigg{]}. It is enough to verify the hypothesis of Theorem 1 for Clearly is continuous, component-wise concave and differentiable with where
[TABLE]
Note that is a dimensional matrix with and . Now observe that
- •
each entry in the row is non-negative.
- •
the sum of the entries in the row is
[TABLE]
So for a dimensional transition probability matrix . It is easy to see that exists (see Lemma 4) with the power series expansion
[TABLE]
Moreover, since each entry in is non-negative, . Hence \big{(}I-J_{U}(Q)\big{)}^{-1}\geq 0. Also from lemma 2 it is clear that the equation has a unique solution. This completes the proof.
Lemma 4**.**
\left\|\big{(}I-J_{U}(Q)\big{)}^{-1}\right\|\leq\frac{1}{w(1-\gamma)}**
*Proof: *Note that
[TABLE]
for a given transition probability matrix . Now suppose that is an eigen-value of then
[TABLE]
From , we have
[TABLE]
where is the spectrum of . Hence for any , \big{(}I-J_{U}(Q)\big{)}^{-1} exists and we have
[TABLE]
This completes the proof.
The following theorem is an adaptation from [15].
Theorem 3**.**
G-SOVI has second order convergence.
*Proof: *Recall that Let be the unique solution of and be the sequence of iterates generated by G-SOVI. Define and . As satisfies , it is a fixed point of . It is enough to show that for a constant It can be shown that for our particular choice of , is Lipschitz (with Lipschitz constant, say, ).
[TABLE]
Utilizing the above properties we have
[TABLE]
where (from Lemma 4) and .
VI Experiments
In this section, we describe the experimental results of our proposed G-SOVI algorithm and compare the same with standard SOVI and value iteration. For this purpose, we use python MDP toolbox [7] for generating the MDP and implementing standard value iteration 222The code for our experiments is available at: https://github.com/raghudiddigi/G-SOVI. The generated MDPs satisfy in order to ensure that . We consider the error as defined below to be the metric for comparison between algorithms. Error for a given algorithm at iteration , denoted , is calculated as follows. We collect the max-norm difference between the optimal value function and the value function estimate at iteration . That is,
[TABLE]
where is the optimal value function of the MDP and is the Q-value function estimate of MDP at iteration . Also, for any state , .
First, we generate independent MDPs each with states, actions and we set the discount factor to be in each case. We run all the algorithms for iterations. The initial Q-values of the algorithms are assigned random integers between 10 and 20 (which are far away from the optimal value function). In Table I, we indicate the average error value (error averaged over MDPs) at the end of iterations for all the algorithms, wherein for G-SOVI, we set as the relaxation parameter. We observe that standard SOVI and G-SOVI with and have low error at the end of iterations compared to the standard value iteration. Moreover, the average error is the least for our proposed G-SOVI algorithm. Also, we find that higher the value of , the smaller is the error between the G-SOVI value function and the optimal value function.
In Table II, we report the performance of G-SOVI for different values of feasible successive relaxation parameters across the same MDPs generated previously (in Table I). The optimal successive relaxation parameter here lies between and . Recall that G-SOVI exhibits faster convergence for any value of that satisfies when compared to standard SOVI (first row of Table II). From Table II, we can conclude that G-SOVI with performs at least as fast as the standard SOVI. Moreover, the higher the value of , the better is the performance, when the algorithm is run for a sufficient number of iterations.
In Table III, we present the results of the three algorithms on four different settings, averaged over MDPs. The standard SOVI and G-SOVI are run with . All the algorithms are run for iterations. We observe that standard SOVI and G-SOVI have low error compared to the standard value iteration. Moreover, the difference here is much more pronounced than in Table I, where algorithms are run for iterations. Recall that the SOVI and G-SOVI algorithms with a fixed give near-optimal value functions. The advantage of using our proposed algorithms is that the Q-value iterates converge to the near-optimal Q-values rapidly. This can also be observed in Figure 1, where we present the convergence of algorithms over iterations on states and actions setting. The SOVI and G-SOVI algorithms converge rapidly to a value and stay constant. In fact, we observe here that the error is less than that obtained by the standard value iteration till iterations. Moreover, G-SOVI computes a solution that gives lower error as compared to that obtained by SOVI.
In Table IV, we indicate the per-iteration execution time of our algorithms across the four settings considered above. We can see that, due to Hessian inversion operation in the second-order techniques, standard and G-SOVI algorithms take more time compared to the standard value iteration algorithm.
Recall that the advantage of second-order methods is that even though the per-iteration computation is higher compared to the first-order methods, the total number of iterations needed to achieve a desired error threshold is much lower in general. Hence, they are capable of achieving lower error in the same computational time. We demonstrate this in Table V for three settings. We select the parameters of this experiment (i.e., number of states and actions, values of , , number of iterations), such that the second-order methods compute better solutions compared to the standard value iteration scheme333The value of respects the constraint in all the three settings.. For example, consider the states and actions setting (first row of Table V). The standard value iteration is run for iterations. It’s per-iteration time is ms which results in an overall computational time of seconds. On the other hand, SOVI techniques (standard SOVI and the G-SOVI) are run for just iterations. However, their per-iteration time is seconds and hence the overall computational time is seconds. We observe that, in seconds, the SOVI methods achieve lower error compared to the standard value iteration. Similarly, in the other two settings in Table V, we see that the second order SOVI algorithms achieve lower error compared to the standard value iteration when run for and seconds, respectively.
It is important to note that this advantage need not hold for MDPs, in general, with large number of states and actions as the overhead for computing the Hessian inverse in large MDPs will be higher that would affect the overall computational time. If one could deploy techniques to improve the computation time for matrix operations, G-SOVI would be preferred for computing the optimal value function, over the standard value iteration.
VII Conclusion
In this work, we have proposed a generalized second-order value iteration scheme based on the Newton-Raphson method for faster convergence to near optimal value function in discounted reward MDP problems. The first step involved constructing a differentiable Bellman equation through an approximation of the operator. We then applied second order Newton method to arrive at the proposed algorithm. We proved the bounds on approximation error and showed second order convergence to the optimal value function. Finally, approaches geared towards easing the computational burden associated with solving problems involving large state and action spaces such as those based on approximate dynamic programming can be developed in the context of G-SOVI schemes in the future.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Richard Bellman. Dynamic programming. Science , 153(3731):34–37, 1966.
- 2[2] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming , volume 5. Athena Scientific, Belmont, MA, 1996.
- 3[3] Vivek S Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint . Cambridge Univ. Press, 2008.
- 4[4] Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. ar Xiv preprint ar Xiv:1712.10285 , 2017.
- 5[5] Adithya M Devraj and Sean Meyn. Zap Q-learning. In Advances in Neural Information Processing Systems , pages 2235–2244, 2017.
- 6[6] Thomas Furmston, Guy Lever, and David Barber. Approximate Newton methods for policy search in Markov decision processes. The Journal of Machine Learning Research , 17(1):8055–8105, 2016.
- 7[7] Github. Python MD Ptoolbox. https://github.com/sawcordwell/pymdptoolbox .
- 8[8] Vineet Goyal and Julien Grand-Clement. A first-order approach to accelerated value iteration. ar Xiv preprint ar Xiv:1905.09963 , 2019.
