Greedy-Step Off-Policy Reinforcement Learning

Yuhui Wang; Qingyuan Wu; Pengcheng He; Xiaoyang Tan

arXiv:2102.11717·cs.LG·December 16, 2021

Greedy-Step Off-Policy Reinforcement Learning

Yuhui Wang, Qingyuan Wu, Pengcheng He, Xiaoyang Tan

PDF

Open Access

TL;DR

This paper introduces a novel multi-step Bellman optimality equation and a corresponding value iteration method that efficiently and reliably learns the optimal policy off-policy, achieving state-of-the-art results.

Contribution

It proposes a new multi-step Bellman optimality equation and a value iteration algorithm that converges rapidly and can safely utilize off-policy data without correction.

Findings

01

Achieves exponential contraction rate $oldsymbol{ ext{O}(oldsymbol{ extgamma^n})}$.

02

Demonstrates state-of-the-art performance on benchmark datasets.

03

Provides reliable and easy-to-implement off-policy algorithms.

Abstract

Most of the policy evaluation algorithms are based on the theories of Bellman Expectation and Optimality Equation, which derive two popular approaches - Policy Iteration (PI) and Value Iteration (VI). However, multi-step bootstrapping is often at cross-purposes with and off-policy learning in PI-based methods due to the large variance of multi-step off-policy correction. In contrast, VI-based methods are naturally off-policy but subject to one-step learning.In this paper, we deduce a novel multi-step Bellman Optimality Equation by utilizing a latent structure of multi-step bootstrapping with the optimal value function. Via this new equation, we derive a new multi-step value iteration method that converges to the optimal value function with exponential contraction rate $O (γ^{n})$ but only linear computational complexity. Moreover, it can naturally derive a suite of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks · Parallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques

MethodsConvolution · Experience Replay · Dense Connections · Q-Learning · Double Q-learning · Deep Q-Network · Double DQN