Provable and Practical: Efficient Exploration in Reinforcement Learning   via Langevin Monte Carlo

Haque Ishfaq; Qingfeng Lan; Pan Xu; A. Rupam Mahmood; Doina Precup,; Anima Anandkumar; Kamyar Azizzadenesheli

arXiv:2305.18246·cs.LG·March 19, 2024·5 cites

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

Haque Ishfaq, Qingfeng Lan, Pan Xu, A. Rupam Mahmood, Doina Precup,, Anima Anandkumar, Kamyar Azizzadenesheli

PDF

Open Access 1 Repo

TL;DR

This paper introduces a scalable exploration strategy for reinforcement learning that directly samples from the posterior distribution of the Q function using Langevin Monte Carlo, improving efficiency and effectiveness in deep RL tasks.

Contribution

It proposes a novel Langevin Monte Carlo-based Thompson sampling method for RL, avoiding Gaussian approximations and enabling easy deployment in deep RL with theoretical guarantees.

Findings

01

Achieves a regret bound of O(d^{3/2}H^{3/2} T) in linear MDPs

02

Demonstrates superior or comparable performance on Atari57 exploration tasks

03

Provides a practical and theoretically sound exploration method for deep RL

Abstract

We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of $\tilde{O} (d^{3/2} H^{3/2} T)$ , where $d$ is the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hmishfaq/lmc-lsvi
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms · Advanced Bandit Algorithms Research

MethodsAdam