Per-decision Multi-step Temporal Difference Learning with Control Variates
Kristopher De Asis, Richard S. Sutton

TL;DR
This paper introduces per-decision control variates to enhance multi-step TD learning in reinforcement learning, effectively reducing variance and improving stability in both on-policy and off-policy scenarios.
Contribution
The paper proposes a novel per-decision control variate technique for multi-step TD algorithms, demonstrating improved performance over existing methods.
Findings
Significant variance reduction in multi-step TD updates.
Enhanced learning stability in off-policy settings.
Improved convergence speed in experiments.
Abstract
Multi-step temporal difference (TD) learning is an important approach in reinforcement learning, as it unifies one-step TD learning with Monte Carlo methods in a way where intermediate algorithms can outperform either extreme. They address a bias-variance trade off between reliance on current estimates, which could be poor, and incorporating longer sampled reward sequences into the updates. Especially in the off-policy setting, where the agent aims to learn about a policy different from the one generating its behaviour, the variance in the updates can cause learning to diverge as the number of sampled rewards used in the estimates increases. In this paper, we introduce per-decision control variates for multi-step TD algorithms, and compare them to existing methods. Our results show that including the control variates can greatly improve performance on both on and off-policy multi-step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Human Pose and Action Recognition · Fuzzy Logic and Control Systems
