A Hidden Markov Restless Multi-armed Bandit Model for Playout Recommendation Systems
Rahul Meshram, Aditya Gopalan, D. Manjunath

TL;DR
This paper models a specialized restless multi-armed bandit for recommendation systems, deriving Whittle indices and proposing a Thompson sampling-based learning algorithm for parameter estimation.
Contribution
It introduces a new RMAB model with two arm types, derives closed-form Whittle indices, and develops a parameter learning algorithm for practical application.
Findings
The RMAB is proven to be Whittle-indexable.
Closed-form Whittle index expressions are obtained.
A Thompson sampling-based algorithm effectively learns arm parameters.
Abstract
We consider a restless multi-armed bandit (RMAB) in which there are two types of arms, say A and B. Each arm can be in one of two states, say or Playing a type A arm brings it to state with probability one and not playing it induces state transitions with arm-dependent probabilities. Whereas playing a type B arm leads it to state with probability and not playing it gets state that dependent on transition probabilities of arm. Further, play of an arm generates a unit reward with a probability that depends on the state of the arm. The belief about the state of the arm can be calculated using a Bayesian update after every play. This RMAB has been designed for use in recommendation systems where the user's preferences depend on the history of recommendations. This RMAB can also be used in applications like creating of playlists or placement of advertisements. In this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Smart Grid Energy Management
