Multi-step Off-policy Learning Without Importance Sampling Ratios

Ashique Rupam Mahmood; Huizhen Yu; Richard S. Sutton

arXiv:1702.03006·cs.LG·February 13, 2017·22 cites

Multi-step Off-policy Learning Without Importance Sampling Ratios

Ashique Rupam Mahmood, Huizhen Yu, Richard S. Sutton

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel off-policy reinforcement learning algorithm that eliminates the need for importance sampling ratios by adaptively varying bootstrapping, achieving stability and better performance in complex tasks.

Contribution

The paper presents the first multi-step off-policy learning algorithm that avoids importance sampling ratios using action-dependent bootstrapping and a two-timescale gradient TD method.

Findings

01

The new algorithm is stable and reduces variance in off-policy learning.

02

It outperforms existing methods in challenging tasks.

03

It generalizes prior algorithms like Tree Backup through action-dependent bootstrapping.

Abstract

To estimate the value functions of policies from exploratory data, most model-free off-policy algorithms rely on importance sampling, where the use of importance sampling ratios often leads to estimates with severe variance. It is thus desirable to learn off-policy without using the ratios. However, such an algorithm does not exist for multi-step learning with function approximation. In this paper, we introduce the first such algorithm based on temporal-difference (TD) learning updates. We show that an explicit use of importance sampling ratios can be eliminated by varying the amount of bootstrapping in TD updates in an action-dependent manner. Our new algorithm achieves stability using a two-timescale gradient-based TD update. A prior algorithm based on lookup table representation called Tree Backup can also be retrieved using action-dependent bootstrapping, becoming a special case of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sinaghiassian/OffpolicyAlgorithms
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning