Revisiting Peng's Q($\lambda$) for Modern Reinforcement Learning
Tadashi Kozuno, Yunhao Tang, Mark Rowland, R\'emi Munos, Steven, Kapturowski, Will Dabney, Michal Valko, David Abel

TL;DR
This paper provides the first theoretical convergence proof for Peng's Q(λ), a non-conservative off-policy reinforcement learning algorithm, and demonstrates its practical effectiveness in complex continuous control tasks.
Contribution
It proves Peng's Q(λ) converges to an optimal policy under certain conditions and validates its empirical performance in complex tasks.
Findings
Peng's Q(λ) converges to an optimal policy when the behavior policy tracks a greedy policy.
Peng's Q(λ) often outperforms conservative algorithms in continuous control tasks.
Theoretical analysis confirms Peng's Q(λ) is both sound and effective.
Abstract
Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms: the former actively cut traces, whereas the latter do not. Recently, Munos et al. (2016) proved the convergence of conservative algorithms to an optimal Q-function. In contrast, non-conservative algorithms are thought to be unsafe and have a limited or no theoretical guarantee. Nonetheless, recent studies have shown that non-conservative algorithms empirically outperform conservative ones. Motivated by the empirical results and the lack of theory, we carry out theoretical analyses of Peng's Q(), a representative example of non-conservative algorithms. We prove that it also converges to an optimal policy provided that the behavior policy slowly tracks a greedy policy in a way similar to conservative policy iteration. Such a result has been conjectured to be true but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Evolutionary Algorithms and Applications · Optimization and Search Problems
