Online Target Q-learning with Reverse Experience Replay: Efficiently   finding the Optimal Policy for Linear MDPs

Naman Agarwal; Syomantak Chaudhuri; Prateek Jain; Dheeraj Nagaraj,; Praneeth Netrapalli

arXiv:2110.08440·cs.LG·October 20, 2021

Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs

Naman Agarwal, Syomantak Chaudhuri, Prateek Jain, Dheeraj Nagaraj,, Praneeth Netrapalli

PDF

Open Access 1 Video

TL;DR

This paper introduces Q-Rex and Q-RexDaRe, novel Q-learning algorithms with reverse experience replay and data reuse, providing the first non-asymptotic sample complexity bounds for linear MDPs and near-optimal results in tabular settings.

Contribution

It presents the first theoretical analysis of Q-learning with online target networks and reverse experience replay, achieving efficient convergence guarantees for linear and tabular MDPs.

Findings

01

Q-Rex efficiently finds the optimal policy for linear MDPs.

02

Q-RexDaRe achieves near-optimal sample complexity in tabular settings.

03

Provides the first non-asymptotic bounds for Q-learning with these modifications.

Abstract

Q-learning is a popular Reinforcement Learning (RL) algorithm which is widely used in practice with function approximation (Mnih et al., 2015). In contrast, existing theoretical results are pessimistic about Q-learning. For example, (Baird, 1995) shows that Q-learning does not converge even with linear function approximation for linear MDPs. Furthermore, even for tabular MDPs with synchronous updates, Q-learning was shown to have sub-optimal sample complexity (Li et al., 2021;Azar et al., 2013). The goal of this work is to bridge the gap between practical success of Q-learning and the relatively pessimistic theoretical results. The starting point of our work is the observation that in practice, Q-learning is used with two important modifications: (i) training with two networks, called online network and target network simultaneously (online target learning, or OTL) , and (ii) experience…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs· slideslive

Taxonomy

TopicsAge of Information Optimization · Reinforcement Learning in Robotics · Advanced Bandit Algorithms Research

MethodsExperience Replay · Q-Learning