A nearly Blackwell-optimal policy gradient method

Vektor Dewanto; Marcus Gallagher

arXiv:2105.13609·cs.LG·July 5, 2022

A nearly Blackwell-optimal policy gradient method

Vektor Dewanto, Marcus Gallagher

PDF

Open Access 1 Repo

TL;DR

This paper introduces a policy gradient method that optimizes both the long-term gain and the transient bias in reinforcement learning, providing a more comprehensive approach to policy evaluation.

Contribution

It develops a novel bi-level optimization algorithm that jointly considers gain and bias, incorporating an RL-specific logarithmic barrier for improved policy selection.

Findings

01

Effective bias optimization improves transient performance.

02

The method offers insights into gain and bias trade-offs.

03

Experimental results validate the approach's advantages.

Abstract

For continuing environments, reinforcement learning (RL) methods commonly maximize the discounted reward criterion with discount factor close to 1 in order to approximate the average reward (the gain). However, such a criterion only considers the long-run steady-state performance, ignoring the transient behaviour in transient states. In this work, we develop a policy gradient method that optimizes the gain, then the bias (which indicates the transient performance and is important to capably select from policies with equal gain). We derive expressions that enable sampling for the gradient of the bias and its preconditioning Fisher matrix. We further devise an algorithm that solves the gain-then-bias (bi-level) optimization. Its key ingredient is an RL-specific logarithmic barrier function. Experimental results provide insights into the fundamental mechanisms of our proposal.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tttor/nbwpg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques