Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

Qiang Liu; Lihong Li; Ziyang Tang; Dengyong Zhou

arXiv:1810.12429·cs.LG·October 31, 2018·112 cites

Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

Qiang Liu, Lihong Li, Ziyang Tang, Dengyong Zhou

PDF

Open Access 2 Repos

TL;DR

This paper introduces a novel off-policy estimation method that applies importance sampling directly on stationary distributions, effectively addressing the high variance problem in infinite-horizon reinforcement learning scenarios.

Contribution

The paper proposes a new density ratio estimation approach using a mini-max loss and provides a closed-form solution in RKHS, improving off-policy evaluation in long-horizon settings.

Findings

01

The method reduces variance in infinite-horizon off-policy estimation.

02

Theoretical analysis confirms the estimator's consistency and efficiency.

03

Empirical results demonstrate improved accuracy over existing methods.

Abstract

We consider the off-policy estimation problem of estimating the expected reward of a target policy using samples collected by a different behavior policy. Importance sampling (IS) has been a key technique to derive (nearly) unbiased estimators, but is known to suffer from an excessively high variance in long-horizon problems. In the extreme case of in infinite-horizon problems, the variance of an IS-based estimator may even be unbounded. In this paper, we propose a new off-policy estimation method that applies IS directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators.Our key contribution is a novel approach to estimating the density ratio of two stationary distributions, with trajectories sampled from only the behavior distribution. We develop a mini-max loss function for the estimation problem, and derive a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAge of Information Optimization · Advanced Bandit Algorithms Research · Smart Grid Energy Management