Q-learning with Posterior Sampling

Priyank Agrawal; Shipra Agrawal; Azmat Azati

arXiv:2506.00917·cs.LG·October 30, 2025

Q-learning with Posterior Sampling

Priyank Agrawal, Shipra Agrawal, Azmat Azati

PDF

Open Access 3 Reviews

TL;DR

This paper introduces PSQL, a Bayesian posterior sampling-based Q-learning algorithm for reinforcement learning, providing theoretical regret bounds and insights into combining posterior sampling with dynamic programming.

Contribution

It presents a simple Q-learning algorithm using Gaussian posteriors, with regret bounds close to the lower bound, and offers new technical insights into posterior sampling in RL.

Findings

01

Achieves regret bound of O(H^2 ext{SAT}T) in tabular episodic MDPs.

02

Provides technical insights into combining posterior sampling with RL algorithms.

03

Lays groundwork for analyzing posterior sampling in more complex RL settings.

Abstract

Bayesian posterior sampling techniques have demonstrated superior empirical performance in many exploration-exploitation settings. However, their theoretical analysis remains a challenge, especially in complex settings like reinforcement learning. In this paper, we introduce Q-Learning with Posterior Sampling (PSQL), a simple Q-learning-based algorithm that uses Gaussian posteriors on Q-values for exploration, akin to the popular Thompson Sampling algorithm in the multi-armed bandit setting. We show that in the tabular episodic MDP setting, PSQL achieves a regret bound of $\tilde{O} (H^{2} S A T)$ , closely matching the known lower bound of $Ω (H S A T)$ . Here, S, A denote the number of states and actions in the underlying Markov Decision Process (MDP), and $T = K H$ with $K$ being the number of episodes and $H$ being the planning horizon. Our work provides several new technical…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 3

Strengths

Authors discuss the limitations of the analysis of the vanilla PSQL algorithm

Weaknesses

- In my opinion, the empirical results are not sufficiently extensive. It would be interesting, for example, to consider a comparison with the PSRL algorithm, which was shown to outperform Staged-RandQL in a recent study (Tiapkin et al., 2023). Furthermore, the paper lacks a comparison in more complex environments, specifically those with a continuous state space; - Another interesting direction would be to extend this algorithm to more practical scenarios with a general state space. If this is

Reviewer 02Rating 8Confidence 4

Strengths

The work is theoretically grounded, algorithmically simple, and provides new insights into the Bayesian interpretation of Q-learning. The regret guarantee is strong, and the analysis tackles key challenges in combining posterior sampling with TD learning.

Weaknesses

Using Gaussian posteriors on Q-values may destroy important structural properties of Q-functions (e.g., boundedness or Bellman consistency), since Gaussian distributions are unbounded. The choice of posterior variance is subtle and strongly affects performance, requiring careful tuning. Moreover, the use of multiple posterior samples for target computation increases the algorithm’s computational complexity, and the theoretically unanalyzed single-sample variant (PSQL*) outperforms the analyzed o

Reviewer 03Rating 6Confidence 3

Strengths

- Interesting alternative explanation of the UCB-Q-learning learning rate, that appears from the additional entropy regularization in the variational approximation, with a clear intuition of "collapse avoidance" with entropy due to bias in the estimate; - Strong empirical performance as well as theoretical regret guarantees;

Weaknesses

- Lack of empirical comparison with a usual RandQL. Although this method does not offer the same rigorous guarantees as its staged version, it would be interesting to compare PSQL* and a usual RandQL without stages. - The regret bound does not match the regret bound of a variance-reduced version of Q-learning (Li et al. 2021);

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition