Optimal Recommendation to Users that React: Online Learning for a Class of POMDPs
Rahul Meshram, Aditya Gopalan, D. Manjunath

TL;DR
This paper models an online recommendation system using POMDPs, accounting for time-dependent user preferences influenced by past recommendations, and develops a learning algorithm with provable guarantees.
Contribution
It introduces a realistic POMDP-based model for recommendation systems and proposes a Thompson sampling algorithm with theoretical performance analysis.
Findings
Structural properties of the POMDP for a single content item.
Optimal policy characterization for the POMDP model.
Regret bounds for the proposed learning algorithm.
Abstract
We describe and study a model for an Automated Online Recommendation System (AORS) in which a user's preferences can be time-dependent and can also depend on the history of past recommendations and play-outs. The three key features of the model that makes it more realistic compared to existing models for recommendation systems are (1) user preference is inherently latent, (2) current recommendations can affect future preferences, and (3) it allows for the development of learning algorithms with provable performance guarantees. The problem is cast as an average-cost restless multi-armed bandit for a given user, with an independent partially observable Markov decision process (POMDP) for each item of content. We analyze the POMDP for a single arm, describe its structural properties, and characterize its optimal policy. We then develop a Thompson sampling-based online reinforcement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Reinforcement Learning in Robotics
