A Temporal-Difference Approach to Policy Gradient Estimation

Samuele Tosatto; Andrew Patterson; Martha White; A. Rupam Mahmood

arXiv:2202.02396·cs.LG·July 8, 2022

A Temporal-Difference Approach to Policy Gradient Estimation

Samuele Tosatto, Andrew Patterson, Martha White, A. Rupam Mahmood

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel off-policy, model-free policy gradient estimator that reconstructs gradients from start states, avoiding distribution shift issues and improving bias-variance trade-offs in reinforcement learning.

Contribution

It proposes a new recursive Bellman equation for gradients and a TD-based gradient critic that provides unbiased estimates regardless of sampling strategy.

Findings

01

Achieves lower bias and variance in gradient estimates

02

Performs better with off-policy data

03

Provides unbiased gradient estimation under certain conditions

Abstract

The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a gradient critic, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that sidesteps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

samuelepolimi/temporal-difference-gradient
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Markov Chains and Monte Carlo Methods · Machine Learning and Algorithms