Stable and Efficient Policy Evaluation
Daoming Lyu, Bo Liu, Matthieu Geist, Wen Dong, Saad Biaz, Qi Wang

TL;DR
This paper introduces new policy evaluation algorithms that are both off-policy stable and on-policy efficient, addressing longstanding issues in reinforcement learning prediction tasks.
Contribution
The paper proposes novel algorithms based on oblique projection that simultaneously achieve off-policy stability and on-policy efficiency, a combination not previously available.
Findings
Empirical results validate the effectiveness of the proposed algorithms.
The new methods outperform traditional TD and gradient TD algorithms.
Algorithms demonstrate robustness across various domains.
Abstract
Policy evaluation algorithms are essential to reinforcement learning due to their ability to predict the performance of a policy. However, there are two long-standing issues lying in this prediction problem that need to be tackled: off-policy stability and on-policy efficiency. The conventional temporal difference (TD) algorithm is known to perform very well in the on-policy setting, yet is not off-policy stable. On the other hand, the gradient TD and emphatic TD algorithms are off-policy stable, but are not on-policy efficient. This paper introduces novel algorithms that are both off-policy stable and on-policy efficient by using the oblique projection method. The empirical experimental results on various domains validate the effectiveness of the proposed approach.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdaptive Dynamic Programming Control · Reinforcement Learning in Robotics · Machine Learning and ELM
