Reinforcement Learning: a Comparison of UCB Versus Alternative Adaptive   Policies

Wesley Cowan; Michael N. Katehakis; Daniel Pirutinsky

arXiv:1909.06019·cs.LG·September 16, 2019

Reinforcement Learning: a Comparison of UCB Versus Alternative Adaptive Policies

Wesley Cowan, Michael N. Katehakis, Daniel Pirutinsky

PDF

Open Access

TL;DR

This paper compares the performance of classic UCB, a new MDP-DMED policy, and a posterior sampling method for reinforcement learning in Markov decision processes with unknown transitions.

Contribution

It introduces the MDP-DMED policy and provides a comparative analysis of its performance against UCB and posterior sampling methods.

Findings

01

MDP-DMED outperforms UCB in certain scenarios.

02

Posterior sampling shows competitive results.

03

The paper offers insights into the strengths and weaknesses of each policy.

Abstract

In this paper we consider the basic version of Reinforcement Learning (RL) that involves computing optimal data driven (adaptive) policies for Markovian decision process with unknown transition probabilities. We provide a brief survey of the state of the art of the area and we compare the performance of the classic UCB policy of \cc{bkmdp97} with a new policy developed herein which we call MDP-Deterministic Minimum Empirical Divergence (MDP-DMED), and a method based on Posterior sampling (MDP-PS).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics