Policy Gradient for Continuing Tasks in Non-stationary Markov Decision   Processes

Santiago Paternain; Juan Andres Bazerque; Alejandro Ribeiro

arXiv:2010.08443·cs.LG·October 19, 2020·1 cites

Policy Gradient for Continuing Tasks in Non-stationary Markov Decision Processes

Santiago Paternain, Juan Andres Bazerque, Alejandro Ribeiro

PDF

Open Access

TL;DR

This paper develops an online policy gradient algorithm for non-stationary Markov decision processes, enabling continuous learning and adaptation without stationarity assumptions, demonstrated through a navigation task example.

Contribution

It introduces a convergence proof for an online policy gradient method in non-stationary MDPs, expanding applicability to continuous, cyclic tasks.

Findings

01

The algorithm converges to critical points of the initial value function.

02

It successfully learns cyclic navigation tasks in non-stationary environments.

03

Theoretical results are supported by numerical experiments.

Abstract

Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities. In this paper we consider the problem of finding optimal policies assuming that they belong to a reproducing kernel Hilbert space (RKHS). To that end we compute unbiased stochastic gradients of the value function which we use as ascent directions to update the policy. A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed. Hence preventing these algorithms to be fully implemented online, which is a desirable property for systems that need to adapt to new tasks and/or environments in deployment. The main requirement for a policy gradient algorithm to work is that the estimate of the gradient at any point in time is an ascent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Distributed Sensor Networks and Detection Algorithms · Advanced Bandit Algorithms Research