Q-learning with Posterior Sampling
Priyank Agrawal, Shipra Agrawal, Azmat Azati

TL;DR
This paper introduces PSQL, a Bayesian posterior sampling-based Q-learning algorithm for reinforcement learning, providing theoretical regret bounds and insights into combining posterior sampling with dynamic programming.
Contribution
It presents a simple Q-learning algorithm using Gaussian posteriors, with regret bounds close to the lower bound, and offers new technical insights into posterior sampling in RL.
Findings
Achieves regret bound of O(H^2 ext{SAT}T) in tabular episodic MDPs.
Provides technical insights into combining posterior sampling with RL algorithms.
Lays groundwork for analyzing posterior sampling in more complex RL settings.
Abstract
Bayesian posterior sampling techniques have demonstrated superior empirical performance in many exploration-exploitation settings. However, their theoretical analysis remains a challenge, especially in complex settings like reinforcement learning. In this paper, we introduce Q-Learning with Posterior Sampling (PSQL), a simple Q-learning-based algorithm that uses Gaussian posteriors on Q-values for exploration, akin to the popular Thompson Sampling algorithm in the multi-armed bandit setting. We show that in the tabular episodic MDP setting, PSQL achieves a regret bound of , closely matching the known lower bound of . Here, S, A denote the number of states and actions in the underlying Markov Decision Process (MDP), and with being the number of episodes and being the planning horizon. Our work provides several new technical…
Peer Reviews
Decision·ICLR 2026 Poster
Authors discuss the limitations of the analysis of the vanilla PSQL algorithm
- In my opinion, the empirical results are not sufficiently extensive. It would be interesting, for example, to consider a comparison with the PSRL algorithm, which was shown to outperform Staged-RandQL in a recent study (Tiapkin et al., 2023). Furthermore, the paper lacks a comparison in more complex environments, specifically those with a continuous state space; - Another interesting direction would be to extend this algorithm to more practical scenarios with a general state space. If this is
The work is theoretically grounded, algorithmically simple, and provides new insights into the Bayesian interpretation of Q-learning. The regret guarantee is strong, and the analysis tackles key challenges in combining posterior sampling with TD learning.
Using Gaussian posteriors on Q-values may destroy important structural properties of Q-functions (e.g., boundedness or Bellman consistency), since Gaussian distributions are unbounded. The choice of posterior variance is subtle and strongly affects performance, requiring careful tuning. Moreover, the use of multiple posterior samples for target computation increases the algorithm’s computational complexity, and the theoretically unanalyzed single-sample variant (PSQL*) outperforms the analyzed o
- Interesting alternative explanation of the UCB-Q-learning learning rate, that appears from the additional entropy regularization in the variational approximation, with a clear intuition of "collapse avoidance" with entropy due to bias in the estimate; - Strong empirical performance as well as theoretical regret guarantees;
- Lack of empirical comparison with a usual RandQL. Although this method does not offer the same rigorous guarantees as its staged version, it would be interesting to compare PSQL* and a usual RandQL without stages. - The regret bound does not match the regret bound of a variance-reduced version of Q-learning (Li et al. 2021);
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition
