Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement   Learning

Haoxuan Pan; Deheng Ye; Xiaoming Duan; Qiang Fu; Wei Yang; Jianping; He; Mingfei Sun

arXiv:2301.08442·cs.LG·February 13, 2023

Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement Learning

Haoxuan Pan, Deheng Ye, Xiaoming Duan, Qiang Fu, Wei Yang, Jianping, He, Mingfei Sun

PDF

Open Access

TL;DR

This paper analyzes the estimation bias in policy gradients for deep reinforcement learning, especially the impact of state distribution shift, and proposes methods like learning rate adjustments and regularization to mitigate this bias, supported by experiments.

Contribution

It extends the understanding of policy gradient bias to deep RL with parameterized policies and offers practical strategies to reduce bias, enhancing policy optimization robustness.

Findings

01

Smaller learning rates improve bias robustness.

02

Adaptive optimizers like Adam mitigate bias effects.

03

KL regularization significantly reduces estimation bias.

Abstract

We revisit the estimation bias in policy gradients for the discounted episodic Markov decision process (MDP) from Deep Reinforcement Learning (DRL) perspective. The objective is formulated theoretically as the expected returns discounted over the time horizon. One of the major policy gradient biases is the state distribution shift: the state distribution used to estimate the gradients differs from the theoretical formulation in that it does not take into account the discount factor. Existing discussion of the influence of this bias was limited to the tabular and softmax cases in the literature. Therefore, in this paper, we extend it to the DRL setting where the policy is parameterized and demonstrate how this bias can lead to suboptimal policies theoretically. We then discuss why the empirically inaccurate implementations with shifted state distribution can still be effective. We show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Age of Information Optimization · Smart Grid Energy Management

MethodsAdam · Softmax