Approximate discounting-free policy evaluation from transient and recurrent states
Vektor Dewanto, Marcus Gallagher

TL;DR
This paper introduces a novel approach for policy evaluation in reinforcement learning that accurately estimates bias from both transient and recurrent states using a seminorm LSTD method, enhancing model-free learning.
Contribution
It develops a new bias approximation system for transient states and derives a seminorm LSTD method enabling sampling-based, model-free policy evaluation.
Findings
Effective bias estimation from transient states demonstrated
Seminorm LSTD provides a unifying framework for policy evaluation
Experimental results confirm the method's validity
Abstract
In order to distinguish policies that prescribe good from bad actions in transient states, we need to evaluate the so-called bias of a policy from transient states. However, we observe that most (if not all) works in approximate discounting-free policy evaluation thus far are developed for estimating the bias solely from recurrent states. We therefore propose a system of approximators for the bias (specifically, its relative value) from transient and recurrent states. Its key ingredient is a seminorm LSTD (least-squares temporal difference), for which we derive its minimizer expression that enables approximation by sampling required in model-free reinforcement learning. This seminorm LSTD also facilitates the formulation of a general unifying procedure for LSTD-based policy value approximators. Experimental results validate the effectiveness of our proposed method.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Smart Grid Energy Management
