Revisit Policy Optimization in Matrix Form

Sitao Luan; Xiao-Wen Chang; Doina Precup

arXiv:1909.09186·cs.LG·September 23, 2019·6 cites

Revisit Policy Optimization in Matrix Form

Sitao Luan, Xiao-Wen Chang, Doina Precup

PDF

Open Access

TL;DR

This paper revisits policy optimization in tabular reinforcement learning by disentangling policy and environment dynamics in matrix form, simplifying policy updates and extending to model-based RL.

Contribution

It introduces a matrix formulation that separates policy and environment effects, enabling more straightforward policy optimization and potential extensions to model-based reinforcement learning.

Findings

01

Reformulation of policy evaluation in matrix form.

02

Unified framework for policy gradient and TRPO.

03

Potential extension to model-based RL.

Abstract

In tabular case, when the reward and environment dynamics are known, policy evaluation can be written as $V_{π} = (I - γ P_{π})^{- 1} r_{π}$ , where $P_{π}$ is the state transition matrix given policy $π$ and $r_{π}$ is the reward signal given $π$ . What annoys us is that $P_{π}$ and $r_{π}$ are both mixed with $π$ , which means every time when we update $π$ , they will change together. In this paper, we leverage the notation from \cite{wang2007dual} to disentangle $π$ and environment dynamics which makes optimization over policy more straightforward. We show that policy gradient theorem \cite{sutton2018reinforcement} and TRPO \cite{schulman2015trust} can be put into a more general framework and such notation has good potential to be extended to model-based reinforcement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptimization and Search Problems · Reinforcement Learning in Robotics · Scheduling and Optimization Algorithms

MethodsTrust Region Policy Optimization