Learning Pessimism for Robust and Efficient Off-Policy Reinforcement Learning
Edoardo Cetin, Oya Celiktutan

TL;DR
This paper introduces Generalized Pessimism Learning (GPL), a learnable penalty method that effectively reduces overestimation bias in off-policy deep reinforcement learning, leading to improved performance across various benchmarks.
Contribution
The paper proposes a novel learnable penalty approach, GPL, with dual TD-learning to dynamically counteract overestimation bias without added computational cost.
Findings
Achieves state-of-the-art results in benchmark tasks
Effectively reduces overestimation bias during training
Integrates seamlessly with existing off-policy algorithms
Abstract
Off-policy deep reinforcement learning algorithms commonly compensate for overestimation bias during temporal-difference learning by utilizing pessimistic estimates of the expected target returns. In this work, we propose Generalized Pessimism Learning (GPL), a strategy employing a novel learnable penalty to enact such pessimism. In particular, we propose to learn this penalty alongside the critic with dual TD-learning, a new procedure to estimate and minimize the magnitude of the target returns bias with trivial computational cost. GPL enables us to accurately counteract overestimation bias throughout training without incurring the downsides of overly pessimistic targets. By integrating GPL with popular off-policy algorithms, we achieve state-of-the-art results in both competitive proprioceptive and pixel-based benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control
