Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Masatoshi Uehara; Jiawei Huang; Nan Jiang

arXiv:1910.12809·cs.LG·October 8, 2020·29 cites

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Masatoshi Uehara, Jiawei Huang, Nan Jiang

PDF

Open Access 1 Video

TL;DR

This paper introduces new estimators MWL and MQL for off-policy evaluation in reinforcement learning, which estimate importance ratios and value functions directly, improving robustness and theoretical understanding.

Contribution

The paper proposes two novel estimators, MWL and MQL, that enhance off-policy evaluation by removing reliance on known behavior policies and providing a unified theoretical framework.

Findings

01

MWL directly estimates importance ratios without behavior policy knowledge

02

MQL minimizes Bellman errors and can be combined with MWL for robustness

03

Sample complexity analyses show asymptotic optimality in tabular settings

Abstract

We provide theoretical investigations into off-policy evaluation in reinforcement learning using function approximators for (marginalized) importance weights and value functions. Our contributions include: (1) A new estimator, MWL, that directly estimates importance ratios over the state-action distributions, removing the reliance on knowledge of the behavior policy as in prior work (Liu et al., 2018). (2) Another new estimator, MQL, obtained by swapping the roles of importance weights and value-functions in MWL. MQL has an intuitive interpretation of minimizing average Bellman errors and can be combined with MWL in a doubly robust manner. (3) Several additional results that offer further insights into these methods, including the sample complexity analyses of MWL and MQL, their asymptotic optimality in the tabular setting, how the learned importance weights depend the choice of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Minimax Weight and Q-Function Learning for Off-Policy Evaluation· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Formal Methods in Verification · Evolutionary Algorithms and Applications