Randomised Bayesian Least-Squares Policy Iteration
Nikolaos Tziortziotis, Christos Dimitrakakis, Michalis Vazirgiannis

TL;DR
This paper presents Bayesian Least-Squares Policy Iteration (BLSPI) and its online variant RBLSPI, which use Bayesian methods for policy evaluation and improve exploration in reinforcement learning through Thompson sampling-inspired action selection.
Contribution
The paper introduces RBLSPI, an online, model-free policy iteration algorithm that leverages Bayesian uncertainty quantification for enhanced exploration in reinforcement learning.
Findings
RBLSPI effectively balances exploration and exploitation.
Experimental results show improved policy performance.
RBLSPI demonstrates strong exploration capabilities.
Abstract
We introduce Bayesian least-squares policy iteration (BLSPI), an off-policy, model-free, policy iteration algorithm that uses the Bayesian least-squares temporal-difference (BLSTD) learning algorithm to evaluate policies. An online variant of BLSPI has been also proposed, called randomised BLSPI (RBLSPI), that improves its policy based on an incomplete policy evaluation step. In online setting, the exploration-exploitation dilemma should be addressed as we try to discover the optimal policy by using samples collected by ourselves. RBLSPI exploits the advantage of BLSTD to quantify our uncertainty about the value function. Inspired by Thompson sampling, RBLSPI first samples a value function from a posterior distribution over value functions, and then selects actions based on the sampled value function. The effectiveness and the exploration abilities of RBLSPI are demonstrated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
