ReversedQ: Opportunities for Faster Q-Learning in Episodic Online Reinforcement Learning
Sofia R. Miskala-Dinc, Aviva Prins

TL;DR
This paper introduces ReversedQ, a set of modifications to Q-learning that significantly accelerates learning speed in episodic MDPs, demonstrated through empirical improvements over prior methods.
Contribution
The paper proposes ReversedQ, a novel approach that optimizes value-function update order, frequency, and initialization to enable faster learning in episodic Q-learning.
Findings
ReversedQ improves scaled mean cumulative reward in BDCL from 9.53% to 78.78%.
ReversedQ increases reward in chain MDP from 21.76% to 61.81%.
Empirical results show substantial performance gains over RandomizedQ.
Abstract
We study model-free Q-learning in finite-horizon episodic Markov Decision Processes (MDPs) with stationary dynamics across episodes. We identify a central issue in nascent model-free posterior-sampling works: the reliance on delayed learning in order to prove theoretical guarantees. In particular, we identify three opportunities for faster learning - (i) value-function update order, (ii) update frequencies, and (iii) value-function initialization. Using Wang et al.'s RandomizedQ as a basis, we illustrate these changes and their individual (as well as cumulative) impact in multiple empirical studies. We find that our combined modifications, termed ReversedQ, improve scaled mean cumulative reward compared to RandomizedQ, from 9.53% to 78.78% in the Bidirectional Diabolical Combination Lock (BDCL), and from 21.76% to 61.81% in a chain MDP.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
