Learning Pessimism for Robust and Efficient Off-Policy Reinforcement   Learning

Edoardo Cetin; Oya Celiktutan

arXiv:2110.03375·cs.LG·March 7, 2023·1 cites

Learning Pessimism for Robust and Efficient Off-Policy Reinforcement Learning

Edoardo Cetin, Oya Celiktutan

PDF

Open Access

TL;DR

This paper introduces Generalized Pessimism Learning (GPL), a learnable penalty method that effectively reduces overestimation bias in off-policy deep reinforcement learning, leading to improved performance across various benchmarks.

Contribution

The paper proposes a novel learnable penalty approach, GPL, with dual TD-learning to dynamically counteract overestimation bias without added computational cost.

Findings

01

Achieves state-of-the-art results in benchmark tasks

02

Effectively reduces overestimation bias during training

03

Integrates seamlessly with existing off-policy algorithms

Abstract

Off-policy deep reinforcement learning algorithms commonly compensate for overestimation bias during temporal-difference learning by utilizing pessimistic estimates of the expected target returns. In this work, we propose Generalized Pessimism Learning (GPL), a strategy employing a novel learnable penalty to enact such pessimism. In particular, we propose to learn this penalty alongside the critic with dual TD-learning, a new procedure to estimate and minimize the magnitude of the target returns bias with trivial computational cost. GPL enables us to accurately counteract overestimation bias throughout training without incurring the downsides of overly pessimistic targets. By integrating GPL with popular off-policy algorithms, we achieve state-of-the-art results in both competitive proprioceptive and pixel-based benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control