Correcting discount-factor mismatch in on-policy policy gradient methods

Fengdi Che; Gautham Vasan; A. Rupam Mahmood

arXiv:2306.13284·cs.LG·June 26, 2023

Correcting discount-factor mismatch in on-policy policy gradient methods

Fengdi Che, Gautham Vasan, A. Rupam Mahmood

PDF

Open Access

TL;DR

This paper identifies a discrepancy in on-policy policy gradient methods related to discounting, proposes a novel correction to improve learning stability and performance, and demonstrates its effectiveness on standard benchmarks.

Contribution

It introduces a new distribution correction method for policy gradients that addresses discount factor mismatch, improving stability and performance.

Findings

01

The correction reduces variance compared to previous methods.

02

The method improves policy performance on OpenAI gym and DeepMind benchmarks.

03

It helps avoid suboptimal policies in environments with similar states.

Abstract

The policy gradient theorem gives a convenient form of the policy gradient in terms of three factors: an action value, a gradient of the action likelihood, and a state distribution involving discounting called the \emph{discounted stationary distribution}. But commonly used on-policy methods based on the policy gradient theorem ignores the discount factor in the state distribution, which is technically incorrect and may even cause degenerate learning behavior in some environments. An existing solution corrects this discrepancy by using $γ^{t}$ as a factor in the gradient estimate. However, this solution is not widely adopted and does not work well in tasks where the later states are similar to earlier states. We introduce a novel distribution correction to account for the discounted stationary distribution that can be plugged into many existing gradient estimators. Our correction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAge of Information Optimization · Stochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data