Classical Policy Gradient: Preserving Bellman's Principle of Optimality

Philip S. Thomas; Scott M. Jordan; Yash Chandak; Chris Nota; James; Kostas

arXiv:1906.03063·cs.LG·June 10, 2019

Classical Policy Gradient: Preserving Bellman's Principle of Optimality

Philip S. Thomas, Scott M. Jordan, Yash Chandak, Chris Nota, James, Kostas

PDF

Open Access

TL;DR

This paper introduces a new objective function for finite-horizon episodic Markov decision processes that aligns more closely with Bellman's principle of optimality, along with its gradient expression.

Contribution

It presents a novel objective function and gradient formulation that improve policy gradient methods in finite-horizon MDPs.

Findings

01

New objective function better captures Bellman's principle

02

Derived explicit gradient expression for the new objective

03

Potential improvements in policy optimization accuracy

Abstract

We propose a new objective function for finite-horizon episodic Markov decision processes that better captures Bellman's principle of optimality, and provide an expression for the gradient of the objective.

Equations4

\nabla\rho(\theta)=\mathbf{E}\left[\sum_{t=0}^{\infty}\gamma^{t}q_{\theta}(S_{t},A_{t})\frac{\partial\ln\big{(}\pi(S_{t},A_{t},\theta)\big{)}}{\partial\theta}\right].

\nabla\rho(\theta)=\mathbf{E}\left[\sum_{t=0}^{\infty}\gamma^{t}q_{\theta}(S_{t},A_{t})\frac{\partial\ln\big{(}\pi(S_{t},A_{t},\theta)\big{)}}{\partial\theta}\right].

\nabla f(\theta)=\mathbf{E}\left[\frac{1}{h}\sum_{t=0}^{h-1}q_{\theta}(S_{t},A_{t})\sum_{i=0}^{t}w(i,t)\frac{\partial\ln\big{(}\pi(S_{i},A_{i},\theta)\big{)}}{\partial\theta}\middle|\theta\right],

\nabla f(\theta)=\mathbf{E}\left[\frac{1}{h}\sum_{t=0}^{h-1}q_{\theta}(S_{t},A_{t})\sum_{i=0}^{t}w(i,t)\frac{\partial\ln\big{(}\pi(S_{i},A_{i},\theta)\big{)}}{\partial\theta}\middle|\theta\right],

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Economic theories and models · Advanced Bandit Algorithms Research

Full text

Classical Policy Gradient: Preserving Bellman’s Principle of Optimality

Philip S. Thomas, Scott M. Jordan, Yash Chandak, Chris Nota, and James Kostas

University of Massachusetts Amherst, College of Information and Computer Sciences

In 1954, Richard Bellman wrote [1]:

Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decisions are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decisions.

This principle of optimality has endured at the foundation of reinforcement learning research, and is central to what remains the classical definition of an optimal policy [2]. Classical reinforcement learning algorithms like $Q$ -learning [3] embody this principle by striving to act optimally in every state that occurs, regardless of when the state occurs.

The start-state objective function, $\rho(\theta)=\mathbf{E}[\sum_{t=0}^{\infty}\gamma^{t}R_{t}|\theta]$ , prioritizes making decisions optimally in the initial state, not necessarily in the states resulting from the first decisions.111We adopt notational standard MDPNv1 [4]. These two goals (optimizing decisions in the initial state and optimizing decisions in subsequent states) can be conflicting when using function approximation, particularly when $\gamma$ is small and the initial state distribution has limited support. So, maximizing $\rho$ does not preserve the principle of optimality.

Let $q_{\theta}(s,a)=\mathbf{E}[\sum_{k=0}^{\infty}\gamma^{k}R_{t+k}|S_{t}{=}s,A_{t}{=}a,\theta]$ so that

[TABLE]

The $\gamma^{t}$ term in (1) discounts the importance of optimal behavior in states that occur at later times. Algorithms purported to update $\theta$ following estimates of $\nabla\rho(\theta)$ typically drop this $\gamma^{t}$ term, since including it or setting $\gamma=1$ results in poor performance. As a result, these algorithms do not capture the essence of $\rho$ , do not maximize $\rho$ , and are not stochastic gradient algorithms [5].

We propose a different objective function for finite-horizon episodic Markov decision processes that better captures the principal of optimality, and provide an expression for its gradient. This new objective, which we call the classical objective function, has the form $f(\theta)=\sum_{s\in\mathcal{S}}d_{\theta}(s)v_{\theta}(s)$ , where $d_{\theta}$ is a distribution over $\mathcal{S}$ and $v_{\theta}(s)=\mathbf{E}[\sum_{k=0}^{\infty}\gamma^{k}R_{t+k}|S_{t}{=}s,\theta]$ . This form harks back to the classical definition of an optimal policy, particularly if $d_{\theta}$ has full support on $\mathcal{S}$ and does not depend on $\theta$ , in which case $f$ preserves the partial ordering on policies used in the classical definition of an optimal policy.

In model-free reinforcement learning, the agent is not free to sample states from an arbitrary distribution, which makes estimating $f$ or its gradient challenging with such a $d_{\theta}$ . So, we trade-off similarity to the classical definition of an optimal policy with the practicality of estimating the objective function and its gradient, and define $d_{\theta}$ to be the on-policy distribution for episodic tasks [2, page 199], but with some probability shifted to the terminal absorbing state: $d_{\theta}(s)=\frac{1}{h}\sum_{t=0}^{h-1}\Pr(S_{t}{=}s|\theta)$ , where $h$ is the horizon. This captures the spirit of classical algorithms like $Q$ -learning using function approximation: updates to function approximators occur when states are encountered, and are not discounted.

In the supplementary material we show that

[TABLE]

where $w(i,t)=1$ if $i\neq t$ , $w(i,t)=(1-\gamma^{t+1})/(1-\gamma)$ if $i=t$ and $\gamma<1$ , and $w(i,t)=t+1$ if $i=t$ and $\gamma=1$ .

The techniques that make estimation of $\nabla\rho$ effective, and which have been developed over 27 years [6], do not necessarily carry over to estimating $\nabla f$ . For example, it is not clear how baselines and control variates (and thus actor-critics) should be leveraged. Developing practical algorithms for (approximately) maximizing $f$ is an open problem—we have only had success with simple REINFORCE-like algorithms.

Notice that $f$ is not an ideal objective since, like $\rho$ , it does not preserve the partial ordering on policies used in the classical definition of an optimal policy, and examples exist wherein it prescribes unreasonable behavior. Still, $f$ presents a new direction for policy gradient research, opening new questions like: 1) are policy gradient algorithms for $\rho$ that drop the $\gamma^{t}$ term better viewed as algorithms for optimizing $f$ ? 2) How should baselines and control variates be leveraged when optimizing $f$ ? 3) Can practical (linear-time and generalized [7]) natural gradient algorithms be derived?222Our experiments with such methods have hitherto been unsuccessful. 4) Do alternate forms for $\nabla f$ facilitate gradient estimation, e.g., writing the $t$ -summation over $\partial\ln(\pi(S_{t},A_{t},\theta))/\partial\theta$ and the inner $i$ -summation over $q_{\theta}(S_{i},A_{i})$ so that the $i$ -summation can be expressed as a new value function that measures the expected sum of state-values rather than the expected sum of rewards—a value function that might be approximated using a new TD-like algorithm, and which might allow for actor-critics for the classical objective? 5) What are the relationships between $f$ , $\rho$ , and the average reward objective? For example, notice that when $\gamma=0$ , $f$ is equivalent to $\rho$ with $\gamma=1$ .

Bibliography7

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bellman [1954] R. Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society , 60(6):503–515, 1954.
2Sutton and Barto [2018] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction . MIT Press, Cambridge, MA, 2nd edition, 2018.
3Watkins [1989] C. Watkins. Learning From Delayed Rewards . Ph D thesis, University of Cambridge, England, 1989.
4Thomas and Okal [2016] P. S. Thomas and B. Okal. A notation for Markov decision processes. ar Xiv preprint ar Xiv:1512.09075 v 2 , 2016.
5Nota and Thomas [2019] C. Nota and P. S. Thomas. Is the policy gradient a gradient? Unpublished, 2019.
6Williams [1992] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning , 8:229–256, 1992.
7Thomas [2014] P. S. Thomas. Ge NGA: A generalization of natural gradient ascent with positive and negative convergence results. In ICML , 2014.