Posterior Sampling Reinforcement Learning with Gaussian Processes for Continuous Control: Sublinear Regret Bounds for Unbounded State Spaces

Hamish Flynn; Joe Watson; Ingmar Posner; Jan Peters

arXiv:2603.08287·stat.ML·March 10, 2026

Posterior Sampling Reinforcement Learning with Gaussian Processes for Continuous Control: Sublinear Regret Bounds for Unbounded State Spaces

Hamish Flynn, Joe Watson, Ingmar Posner, Jan Peters

PDF

Open Access

TL;DR

This paper provides a theoretical analysis of Gaussian process posterior sampling reinforcement learning (GP-PSRL) for continuous control, establishing sublinear regret bounds even with unbounded state spaces, thus advancing the understanding of its performance guarantees.

Contribution

The paper derives the first tight Bayesian regret bounds for GP-PSRL in unbounded state spaces, using advanced probabilistic inequalities and chaining methods to improve prior theoretical results.

Findings

01

Regret bound of order (H^{3/2}\u221a{}(rac{ ext{max info gain}}{T}) T)

02

States visited are contained within a near-constant radius ball with high probability

03

Provides a theoretical foundation for analyzing GP-PSRL in complex, unbounded environments

Abstract

We analyze the Bayesian regret of the Gaussian process posterior sampling reinforcement learning (GP-PSRL) algorithm. Posterior sampling is an effective heuristic for decision-making under uncertainty that has been used to develop successful algorithms for a variety of continuous control problems. However, theoretical work on GP-PSRL is limited. All known regret bounds either fail to achieve a tight dependence on a kernel-dependent quantity called the maximum information gain or fail to properly account for the fact that the set of possible system states is unbounded. Through a recursive application of the Borell-Tsirelson-Ibragimov-Sudakov inequality, we show that, with high probability, the states actually visited by the algorithm are contained within a ball of near-constant radius. To obtain tight dependence on the maximum information gain, we use the chaining method to control the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Gaussian Processes and Bayesian Inference · Reinforcement Learning in Robotics