Greedy-Step Off-Policy Reinforcement Learning
Yuhui Wang, Qingyuan Wu, Pengcheng He, Xiaoyang Tan

TL;DR
This paper introduces a novel multi-step Bellman optimality equation and a corresponding value iteration method that efficiently and reliably learns the optimal policy off-policy, achieving state-of-the-art results.
Contribution
It proposes a new multi-step Bellman optimality equation and a value iteration algorithm that converges rapidly and can safely utilize off-policy data without correction.
Findings
Achieves exponential contraction rate $oldsymbol{ ext{O}(oldsymbol{ extgamma^n})}$.
Demonstrates state-of-the-art performance on benchmark datasets.
Provides reliable and easy-to-implement off-policy algorithms.
Abstract
Most of the policy evaluation algorithms are based on the theories of Bellman Expectation and Optimality Equation, which derive two popular approaches - Policy Iteration (PI) and Value Iteration (VI). However, multi-step bootstrapping is often at cross-purposes with and off-policy learning in PI-based methods due to the large variance of multi-step off-policy correction. In contrast, VI-based methods are naturally off-policy but subject to one-step learning.In this paper, we deduce a novel multi-step Bellman Optimality Equation by utilizing a latent structure of multi-step bootstrapping with the optimal value function. Via this new equation, we derive a new multi-step value iteration method that converges to the optimal value function with exponential contraction rate but only linear computational complexity. Moreover, it can naturally derive a suite of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Parallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques
MethodsConvolution · Experience Replay · Dense Connections · Q-Learning · Double Q-learning · Deep Q-Network · Double DQN
