Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning
Brett Daley, Martha White, Christopher Amato, Marlos C. Machado

TL;DR
This paper introduces a new multistep operator for off-policy reinforcement learning that unifies and analyzes per-decision and trajectory-aware methods, providing convergence guarantees and a robust sampling technique.
Contribution
It proposes a novel multistep operator that encompasses existing methods, proves their convergence in tabular settings, and introduces RBIS for improved off-policy control.
Findings
Proves convergence conditions for the new operator and existing methods.
Introduces RBIS, a trajectory-aware sampling method with robust performance.
Unifies per-decision and trajectory-aware approaches under a common framework.
Abstract
Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Smart Grid Energy Management · Advanced Bandit Algorithms Research
