Reinforcement Learning: a Comparison of UCB Versus Alternative Adaptive Policies
Wesley Cowan, Michael N. Katehakis, Daniel Pirutinsky

TL;DR
This paper compares the performance of classic UCB, a new MDP-DMED policy, and a posterior sampling method for reinforcement learning in Markov decision processes with unknown transitions.
Contribution
It introduces the MDP-DMED policy and provides a comparative analysis of its performance against UCB and posterior sampling methods.
Findings
MDP-DMED outperforms UCB in certain scenarios.
Posterior sampling shows competitive results.
The paper offers insights into the strengths and weaknesses of each policy.
Abstract
In this paper we consider the basic version of Reinforcement Learning (RL) that involves computing optimal data driven (adaptive) policies for Markovian decision process with unknown transition probabilities. We provide a brief survey of the state of the art of the area and we compare the performance of the classic UCB policy of \cc{bkmdp97} with a new policy developed herein which we call MDP-Deterministic Minimum Empirical Divergence (MDP-DMED), and a method based on Posterior sampling (MDP-PS).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
