Multi-step Off-policy Learning Without Importance Sampling Ratios
Ashique Rupam Mahmood, Huizhen Yu, Richard S. Sutton

TL;DR
This paper introduces a novel off-policy reinforcement learning algorithm that eliminates the need for importance sampling ratios by adaptively varying bootstrapping, achieving stability and better performance in complex tasks.
Contribution
The paper presents the first multi-step off-policy learning algorithm that avoids importance sampling ratios using action-dependent bootstrapping and a two-timescale gradient TD method.
Findings
The new algorithm is stable and reduces variance in off-policy learning.
It outperforms existing methods in challenging tasks.
It generalizes prior algorithms like Tree Backup through action-dependent bootstrapping.
Abstract
To estimate the value functions of policies from exploratory data, most model-free off-policy algorithms rely on importance sampling, where the use of importance sampling ratios often leads to estimates with severe variance. It is thus desirable to learn off-policy without using the ratios. However, such an algorithm does not exist for multi-step learning with function approximation. In this paper, we introduce the first such algorithm based on temporal-difference (TD) learning updates. We show that an explicit use of importance sampling ratios can be eliminated by varying the amount of bootstrapping in TD updates in an action-dependent manner. Our new algorithm achieves stability using a two-timescale gradient-based TD update. A prior algorithm based on lookup table representation called Tree Backup can also be retrieved using action-dependent bootstrapping, becoming a special case of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning
