Directly Estimating the Variance of the {\lambda}-Return Using Temporal-Difference Methods
Craig Sherstan, Brendan Bennett, Kenny Young, Dylan R. Ashley, Adam, White, Martha White, Richard S. Sutton

TL;DR
This paper introduces a simple and robust method for directly estimating the variance of the {\lambda}-return in reinforcement learning, improving risk assessment and parameter adaptation during online learning.
Contribution
We propose a novel, simpler approach to estimate the variance of the {\lambda}-return directly, outperforming complex existing methods in robustness and empirical performance.
Findings
The new method is simpler than prior approaches.
It performs at least as well as existing methods.
It demonstrates increased robustness in empirical tests.
Abstract
This paper investigates estimating the variance of a temporal-difference learning agent's update target. Most reinforcement learning methods use an estimate of the value function, which captures how good it is for the agent to be in a particular state and is mathematically expressed as the expected sum of discounted future rewards (called the return). These values can be straightforwardly estimated by averaging batches of returns using Monte Carlo methods. However, if we wish to update the agent's value estimates during learning--before terminal outcomes are observed--we must use a different estimation target called the {\lambda}-return, which truncates the return with the agent's own estimate of the value function. Temporal difference learning methods estimate the expected {\lambda}-return for each state, allowing these methods to update online and incrementally, and in most cases…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems
