Minimax Weight and Q-Function Learning for Off-Policy Evaluation
Masatoshi Uehara, Jiawei Huang, Nan Jiang

TL;DR
This paper introduces new estimators MWL and MQL for off-policy evaluation in reinforcement learning, which estimate importance ratios and value functions directly, improving robustness and theoretical understanding.
Contribution
The paper proposes two novel estimators, MWL and MQL, that enhance off-policy evaluation by removing reliance on known behavior policies and providing a unified theoretical framework.
Findings
MWL directly estimates importance ratios without behavior policy knowledge
MQL minimizes Bellman errors and can be combined with MWL for robustness
Sample complexity analyses show asymptotic optimality in tabular settings
Abstract
We provide theoretical investigations into off-policy evaluation in reinforcement learning using function approximators for (marginalized) importance weights and value functions. Our contributions include: (1) A new estimator, MWL, that directly estimates importance ratios over the state-action distributions, removing the reliance on knowledge of the behavior policy as in prior work (Liu et al., 2018). (2) Another new estimator, MQL, obtained by swapping the roles of importance weights and value-functions in MWL. MQL has an intuitive interpretation of minimizing average Bellman errors and can be combined with MWL in a doubly robust manner. (3) Several additional results that offer further insights into these methods, including the sample complexity analyses of MWL and MQL, their asymptotic optimality in the tabular setting, how the learned importance weights depend the choice of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Formal Methods in Verification · Evolutionary Algorithms and Applications
