The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

Lex Weaver; Nigel Tao

arXiv:1301.2315·cs.LG·January 14, 2013·164 cites

The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

Lex Weaver, Nigel Tao

PDF

Open Access

TL;DR

This paper demonstrates that using the long-term average reward as a baseline in gradient-based reinforcement learning reduces variance without bias, leading to improved algorithm performance.

Contribution

It introduces the optimal reward baseline as the long-term average reward, enhancing variance reduction in policy-gradient methods without adding bias.

Findings

01

Optimal baseline equals long-term average reward.

02

Variance is reduced without bias.

03

Improved performance in experiments.

Abstract

There exist a number of reinforcement learning algorithms which learnby climbing the gradient of expected reward. Their long-runconvergence has been proved, even in partially observableenvironments with non-deterministic actions, and without the need fora system model. However, the variance of the gradient estimator hasbeen found to be a significant practical problem. Recent approacheshave discounted future rewards, introducing a bias-variance trade-offinto the gradient estimate. We incorporate a reward baseline into thelearning system, and show that it affects variance without introducingfurther bias. In particular, as we approach the zero-bias,high-variance parameterization, the optimal (or variance minimizing)constant reward baseline is equal to the long-term average expectedreward. Modified policy-gradient algorithms are presented, and anumber of experiments demonstrate their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Advanced Bandit Algorithms Research