A Bayesian Approach to Learning Bandit Structure in Markov Decision Processes
Kelly W. Zhang, Omer Gottesman, Finale Doshi-Velez

TL;DR
This paper introduces a Bayesian online algorithm that helps determine whether a decision-making environment is better modeled as a contextual bandit or an MDP, improving learning efficiency and robustness.
Contribution
It presents a Bayesian hypothesis testing approach that incorporates prior knowledge to adaptively distinguish between CB and MDP environments, interpolating between the two models.
Findings
Lower regret in CB settings compared to MDP algorithms
Effective learning of optimal policy in MDP settings
Robustness to environment misspecification
Abstract
In the reinforcement learning literature, there are many algorithms developed for either Contextual Bandit (CB) or Markov Decision Processes (MDP) environments. However, when deploying reinforcement learning algorithms in the real world, even with domain expertise, it is often difficult to know whether it is appropriate to treat a sequential decision making problem as a CB or an MDP. In other words, do actions affect future states, or only the immediate rewards? Making the wrong assumption regarding the nature of the environment can lead to inefficient learning, or even prevent the algorithm from ever learning an optimal policy, even with infinite data. In this work we develop an online algorithm that uses a Bayesian hypothesis testing approach to learn the nature of the environment. Our algorithm allows practitioners to incorporate prior knowledge about whether the environment is that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Data Stream Mining Techniques · Reinforcement Learning in Robotics
