Optimism in Reinforcement Learning and Kullback-Leibler Divergence
Sarah Filippi (LTCI), Olivier Capp\'e (LTCI), Aur\'elien Garivier, (LTCI)

TL;DR
This paper advocates for using Kullback-Leibler divergence in optimistic model-based reinforcement learning, introducing KL-UCRL, which matches UCRL2's theoretical guarantees but shows improved empirical performance, especially in less connected MDPs.
Contribution
The paper introduces KL-UCRL, an efficient algorithm leveraging KL divergence for optimism in reinforcement learning, with theoretical regret guarantees and improved empirical results.
Findings
KL-UCRL matches UCRL2's regret bounds.
Numerical experiments show better performance in sparse MDPs.
Geometric analysis explains the improved behavior.
Abstract
We consider model-based reinforcement learning in finite Markov De- cision Processes (MDPs), focussing on so-called optimistic strategies. In MDPs, optimism can be implemented by carrying out extended value it- erations under a constraint of consistency with the estimated model tran- sition probabilities. The UCRL2 algorithm by Auer, Jaksch and Ortner (2009), which follows this strategy, has recently been shown to guarantee near-optimal regret bounds. In this paper, we strongly argue in favor of using the Kullback-Leibler (KL) divergence for this purpose. By studying the linear maximization problem under KL constraints, we provide an ef- ficient algorithm, termed KL-UCRL, for solving KL-optimistic extended value iteration. Using recent deviation bounds on the KL divergence, we prove that KL-UCRL provides the same guarantees as UCRL2 in terms of regret. However, numerical experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
