Consistent On-Line Off-Policy Evaluation
Assaf Hallak, Shie Mannor

TL;DR
This paper introduces COP-TD(λ,β), an off-policy evaluation algorithm that corrects distribution mismatch bias, converges to on-policy values, and shows promising empirical results over existing methods.
Contribution
It proposes COP-TD(λ,β), a novel off-policy TD algorithm that addresses stationary distribution discrepancies and converges to on-policy values, improving evaluation accuracy.
Findings
COP-TD(λ,β) reduces bias in off-policy evaluation.
The algorithm converges to on-policy value estimates.
Empirical results outperform current state-of-the-art methods.
Abstract
The problem of on-line off-policy evaluation (OPE) has been actively studied in the last decade due to its importance both as a stand-alone problem and as a module in a policy improvement scheme. However, most Temporal Difference (TD) based solutions ignore the discrepancy between the stationary distribution of the behavior and target policies and its effect on the convergence limit when function approximation is applied. In this paper we propose the Consistent Off-Policy Temporal Difference (COP-TD(, )) algorithm that addresses this issue and reduces this bias at some computational expense. We show that COP-TD(, ) can be designed to converge to the same value that would have been obtained by using on-policy TD() with the target policy. Subsequently, the proposed scheme leads to a related and promising heuristic we call log-COP-TD(,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSmart Grid Energy Management · Water Quality Monitoring Technologies · Optimization and Search Problems
