Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation
Qiang Liu, Lihong Li, Ziyang Tang, Dengyong Zhou

TL;DR
This paper introduces a novel off-policy estimation method that applies importance sampling directly on stationary distributions, effectively addressing the high variance problem in infinite-horizon reinforcement learning scenarios.
Contribution
The paper proposes a new density ratio estimation approach using a mini-max loss and provides a closed-form solution in RKHS, improving off-policy evaluation in long-horizon settings.
Findings
The method reduces variance in infinite-horizon off-policy estimation.
Theoretical analysis confirms the estimator's consistency and efficiency.
Empirical results demonstrate improved accuracy over existing methods.
Abstract
We consider the off-policy estimation problem of estimating the expected reward of a target policy using samples collected by a different behavior policy. Importance sampling (IS) has been a key technique to derive (nearly) unbiased estimators, but is known to suffer from an excessively high variance in long-horizon problems. In the extreme case of in infinite-horizon problems, the variance of an IS-based estimator may even be unbounded. In this paper, we propose a new off-policy estimation method that applies IS directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators.Our key contribution is a novel approach to estimating the density ratio of two stationary distributions, with trajectories sampled from only the behavior distribution. We develop a mini-max loss function for the estimation problem, and derive a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAge of Information Optimization · Advanced Bandit Algorithms Research · Smart Grid Energy Management
