Low Variance Off-policy Evaluation with State-based Importance Sampling
David M. Bossens, Philip S. Thomas

TL;DR
This paper introduces state-based importance sampling estimators for off-policy evaluation in reinforcement learning, significantly reducing variance and improving accuracy by selectively dropping states from importance weight calculations.
Contribution
It proposes novel state-based importance sampling methods that lower variance in off-policy evaluation, applicable across various existing estimators.
Findings
State-based estimators consistently reduce variance.
Improved accuracy over traditional importance sampling methods.
Effective across multiple off-policy evaluation techniques.
Abstract
In many domains, the exploration process of reinforcement learning will be too costly as it requires trying out suboptimal policies, resulting in a need for off-policy evaluation, in which a target policy is evaluated based on data collected from a known behaviour policy. In this context, importance sampling estimators provide estimates for the expected return by weighting the trajectory based on the probability ratio of the target policy and the behaviour policy. Unfortunately, such estimators have a high variance and therefore a large mean squared error. This paper proposes state-based importance sampling estimators which reduce the variance by dropping certain states from the computation of the importance weight. To illustrate their applicability, we demonstrate state-based variants of ordinary importance sampling, weighted importance sampling, per-decision importance sampling,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
