Average-Reward Off-Policy Policy Evaluation with Function Approximation

Shangtong Zhang; Yi Wan; Richard S. Sutton; Shimon Whiteson

arXiv:2101.02808·cs.LG·October 19, 2022·6 cites

Average-Reward Off-Policy Policy Evaluation with Function Approximation

Shangtong Zhang, Yi Wan, Richard S. Sutton, Shimon Whiteson

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces two novel off-policy evaluation algorithms for average-reward MDPs with function approximation, achieving convergence without density ratio estimation and demonstrating superior empirical performance.

Contribution

The paper presents the first convergent off-policy linear function approximation algorithms for reward rate and differential value estimation in average-reward MDPs, avoiding density ratio estimation.

Findings

01

Algorithms outperform density-ratio-based approaches in experiments.

02

Proposed methods are the first to guarantee convergence in this setting.

03

Nonlinear variants also show empirical advantages.

Abstract

We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ShangtongZhang/DeepRL
pytorchOfficial

Videos

Average-Reward Off-Policy Policy Evaluation with Function Approximation· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Optimization and Search Problems

MethodsFeedback Alignment