Empirical Q-Value Iteration
Dileep Kalathil, Vivek S. Borkar, Rahul Jain

TL;DR
This paper introduces the empirical Q-value iteration (EQVI), a new algorithm for learning optimal Q-values in MDPs that converges faster and does not rely on stochastic approximation methods, with proven convergence and sample complexity bounds.
Contribution
The paper presents a novel EQVI algorithm that converges to the optimal Q-value function without stochastic approximation, providing convergence rates and an asynchronous variant.
Findings
Faster convergence to approximate optimal Q-values compared to traditional methods.
Proven convergence and non-asymptotic sample complexity bounds.
Effective asynchronous (online) implementation demonstrated.
Abstract
We propose a new simple and natural algorithm for learning the optimal Q-value function of a discounted-cost Markov Decision Process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Q-learning and actor-critic algorithms, this algorithm doesn't depend on a stochastic approximation-based method. We show that our algorithm, which we call the empirical Q-value iteration (EQVI) algorithm, converges to the optimal Q-value function. We also give a rate of convergence or a non-asymptotic sample complexity bound, and also show that an asynchronous (or online) version of the algorithm will also work. Preliminary experimental results suggest a faster rate of convergence to a ball park estimate for our algorithm compared to stochastic approximation-based algorithms.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsQ-Learning
