Policy Gradient for Continuing Tasks in Non-stationary Markov Decision Processes
Santiago Paternain, Juan Andres Bazerque, Alejandro Ribeiro

TL;DR
This paper develops an online policy gradient algorithm for non-stationary Markov decision processes, enabling continuous learning and adaptation without stationarity assumptions, demonstrated through a navigation task example.
Contribution
It introduces a convergence proof for an online policy gradient method in non-stationary MDPs, expanding applicability to continuous, cyclic tasks.
Findings
The algorithm converges to critical points of the initial value function.
It successfully learns cyclic navigation tasks in non-stationary environments.
Theoretical results are supported by numerical experiments.
Abstract
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities. In this paper we consider the problem of finding optimal policies assuming that they belong to a reproducing kernel Hilbert space (RKHS). To that end we compute unbiased stochastic gradients of the value function which we use as ascent directions to update the policy. A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed. Hence preventing these algorithms to be fully implemented online, which is a desirable property for systems that need to adapt to new tasks and/or environments in deployment. The main requirement for a policy gradient algorithm to work is that the estimate of the gradient at any point in time is an ascent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Distributed Sensor Networks and Detection Algorithms · Advanced Bandit Algorithms Research
