Batch Reinforcement Learning with a Nonparametric Off-Policy Policy Gradient
Samuele Tosatto, Jo\~ao Carvalho, Jan Peters

TL;DR
This paper introduces a nonparametric Bellman equation for off-policy reinforcement learning that provides low-variance, unbiased policy gradient estimates, improving sample efficiency in control tasks.
Contribution
It proposes a novel nonparametric Bellman equation solution that enables reliable, low-variance policy gradient estimation without importance sampling biases.
Findings
Outperforms baseline methods in sample efficiency
Provides unbiased, low-variance policy gradient estimates
Effective on classical control tasks
Abstract
Off-policy Reinforcement Learning (RL) holds the promise of better data efficiency as it allows sample reuse and potentially enables safe interaction with the environment. Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates. The price of inefficiency becomes evident in real-world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited, and a very high sample cost hinders straightforward application. In this paper, we propose a nonparametric Bellman equation, which can be solved in closed form. The solution is differentiable w.r.t the policy parameters and gives access to an estimation of the policy gradient. In this way, we avoid the high variance of importance sampling approaches, and the high bias of semi-gradient methods. We empirically analyze the quality of our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
