Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent   Baseline

Wenjia Meng; Qian Zheng; Long Yang; Yilong Yin; and Gang Pan

arXiv:2405.02572·cs.LG·May 7, 2024

Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

Wenjia Meng, Qian Zheng, Long Yang, Yilong Yin, and Gang Pan

PDF

Open Access

TL;DR

This paper introduces Off-OAB, an off-policy policy gradient method with an optimal action-dependent baseline that reduces variance and improves sample efficiency in reinforcement learning, outperforming existing methods on benchmark tasks.

Contribution

The paper proposes a novel off-policy policy gradient method with an optimal action-dependent baseline that minimizes variance while maintaining unbiasedness, with practical approximations for efficiency.

Findings

01

Outperforms state-of-the-art methods on most OpenAI Gym and MuJoCo tasks.

02

Effectively reduces variance of the off-policy policy gradient estimator.

03

Enhances sample efficiency in reinforcement learning training.

Abstract

Policy-based methods have achieved remarkable success in solving challenging reinforcement learning problems. Among these methods, off-policy policy gradient methods are particularly important due to that they can benefit from off-policy data. However, these methods suffer from the high variance of the off-policy policy gradient (OPPG) estimator, which results in poor sample efficiency during training. In this paper, we propose an off-policy policy gradient method with the optimal action-dependent baseline (Off-OAB) to mitigate this variance issue. Specifically, this baseline maintains the OPPG estimator's unbiasedness while theoretically minimizing its variance. To enhance practical computational efficiency, we design an approximated version of this optimal baseline. Utilizing this approximation, our method (Off-OAB) aims to decrease the OPPG estimator's variance during policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRisk and Portfolio Optimization