Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching
Xiang Li, Nan Jiang

TL;DR
Q-MMR introduces a new off-policy evaluation framework for finite-horizon MDPs, leveraging recursive reweighting and moment matching to improve accuracy and theoretical guarantees.
Contribution
It proposes a novel scalar weighting method with finite-sample guarantees under realizability, connecting to existing importance sampling and FQE techniques.
Findings
Finite-sample guarantee with dimension-free bound under realizability.
Connection established between Q-MMR and importance sampling, linear FQE.
Theoretical insights into coverage in offline reinforcement learning.
Abstract
We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, and perhaps surprisingly, a data-dependent finite-sample guarantee for general function approximation can be established under only the realizability of , with a dimension-free bound -- that is, the error does not depend on the statistical complexity of the function class. We also establish connections to several existing methods, such as importance sampling and linear FQE. Further theoretical analyses shed new light on the nature of coverage, a concept of fundamental importance to offline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
