A Unified Off-Policy Evaluation Approach for General Value Function
Tengyu Xu, Zhuoran Yang, Zhaoran Wang, Yingbin Liang

TL;DR
This paper introduces GenTD, a novel off-policy evaluation algorithm for General Value Functions in reinforcement learning, providing convergence guarantees and efficient joint evaluation of multiple interrelated GVFs.
Contribution
The paper proposes GenTD, the first off-policy GVF evaluation method with convergence guarantees and efficient multi-GVF learning.
Findings
GenTD learns multiple GVFs as efficiently as a single scalar value function.
GenTD guarantees convergence to ground truth GVFs under sufficient function approximation.
GenTD is the first off-policy GVF evaluation algorithm with global optimality guarantees.
Abstract
General Value Function (GVF) is a powerful tool to represent both the {\em predictive} and {\em retrospective} knowledge in reinforcement learning (RL). In practice, often multiple interrelated GVFs need to be evaluated jointly with pre-collected off-policy samples. In the literature, the gradient temporal difference (GTD) learning method has been adopted to evaluate GVFs in the off-policy setting, but such an approach may suffer from a large estimation error even if the function approximation class is sufficiently expressive. Moreover, none of the previous work have formally established the convergence guarantee to the ground truth GVFs under the function approximation settings. In this paper, we address both issues through the lens of a class of GVFs with causal filtering, which cover a wide range of RL applications such as reward variance, value gradient, cost in anomaly detection,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control
