Empirical Q-Value Iteration

Dileep Kalathil; Vivek S. Borkar; Rahul Jain

arXiv:1412.0180·math.OC·January 31, 2019

Empirical Q-Value Iteration

Dileep Kalathil, Vivek S. Borkar, Rahul Jain

PDF

TL;DR

This paper introduces the empirical Q-value iteration (EQVI), a new algorithm for learning optimal Q-values in MDPs that converges faster and does not rely on stochastic approximation methods, with proven convergence and sample complexity bounds.

Contribution

The paper presents a novel EQVI algorithm that converges to the optimal Q-value function without stochastic approximation, providing convergence rates and an asynchronous variant.

Findings

01

Faster convergence to approximate optimal Q-values compared to traditional methods.

02

Proven convergence and non-asymptotic sample complexity bounds.

03

Effective asynchronous (online) implementation demonstrated.

Abstract

We propose a new simple and natural algorithm for learning the optimal Q-value function of a discounted-cost Markov Decision Process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Q-learning and actor-critic algorithms, this algorithm doesn't depend on a stochastic approximation-based method. We show that our algorithm, which we call the empirical Q-value iteration (EQVI) algorithm, converges to the optimal Q-value function. We also give a rate of convergence or a non-asymptotic sample complexity bound, and also show that an asynchronous (or online) version of the algorithm will also work. Preliminary experimental results suggest a faster rate of convergence to a ball park estimate for our algorithm compared to stochastic approximation-based algorithms.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsQ-Learning