Optimal Estimation of Off-Policy Policy Gradient via Double Fitted   Iteration

Chengzhuo Ni; Ruiqi Zhang; Xiang Ji; Xuezhou Zhang; Mengdi Wang

arXiv:2202.00076·stat.ML·June 22, 2022

Optimal Estimation of Off-Policy Policy Gradient via Double Fitted Iteration

Chengzhuo Ni, Ruiqi Zhang, Xiang Ji, Xuezhou Zhang, Mengdi Wang

PDF

Open Access

TL;DR

This paper introduces the double Fitted Policy Gradient (FPG) algorithm for off-policy policy gradient estimation, achieving optimal statistical properties and outperforming existing methods in various settings.

Contribution

The paper proposes a novel FPG algorithm that works with arbitrary policy parameterizations and provides tight finite-sample bounds and asymptotic normality results.

Findings

01

FPG achieves statistically optimal estimation error matching the Cramer-Rao bound.

02

FPG outperforms existing importance sampling and variance reduction methods.

03

Empirical results show significant improvements in policy estimation and optimization tasks.

Abstract

Policy gradient (PG) estimation becomes a challenge when we are not allowed to sample with the target policy but only have access to a dataset generated by some unknown behavior policy. Conventional methods for off-policy PG estimation often suffer from either significant bias or exponentially large variance. In this paper, we propose the double Fitted PG estimation (FPG) algorithm. FPG can work with an arbitrary policy parameterization, assuming access to a Bellman-complete value function class. In the case of linear value function approximation, we provide a tight finite-sample upper bound on policy gradient estimation error, that is governed by the amount of distribution mismatch measured in feature space. We also establish the asymptotic normality of FPG estimation error with a precise covariance characterization, which is further shown to be statistically optimal with a matching…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Advanced Neural Network Applications

MethodsConvolution · Feature Pyramid Grid · Softmax · *Communicated@Fast*How Do I Communicate to Expedia?