Batch Reinforcement Learning with a Nonparametric Off-Policy Policy   Gradient

Samuele Tosatto; Jo\~ao Carvalho; Jan Peters

arXiv:2010.14771·cs.LG·June 9, 2021

Batch Reinforcement Learning with a Nonparametric Off-Policy Policy Gradient

Samuele Tosatto, Jo\~ao Carvalho, Jan Peters

PDF

TL;DR

This paper introduces a nonparametric Bellman equation for off-policy reinforcement learning that provides low-variance, unbiased policy gradient estimates, improving sample efficiency in control tasks.

Contribution

It proposes a novel nonparametric Bellman equation solution that enables reliable, low-variance policy gradient estimation without importance sampling biases.

Findings

01

Outperforms baseline methods in sample efficiency

02

Provides unbiased, low-variance policy gradient estimates

03

Effective on classical control tasks

Abstract

Off-policy Reinforcement Learning (RL) holds the promise of better data efficiency as it allows sample reuse and potentially enables safe interaction with the environment. Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates. The price of inefficiency becomes evident in real-world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited, and a very high sample cost hinders straightforward application. In this paper, we propose a nonparametric Bellman equation, which can be solved in closed form. The solution is differentiable w.r.t the policy parameters and gives access to an estimation of the policy gradient. In this way, we avoid the high variance of importance sampling approaches, and the high bias of semi-gradient methods. We empirically analyze the quality of our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.