Latent Preference Bandits
Newton Mwai, Emil Carlsson, Fredrik D. Johansson

TL;DR
This paper introduces a relaxed latent bandit model that only requires preference orderings rather than full reward distributions, enabling more flexible and practical personalized decision-making with fewer exploration costs.
Contribution
It proposes a new latent bandit framework based on preference orderings and provides a posterior-sampling algorithm with competitive empirical performance.
Findings
Algorithm performs well with known preference orderings.
Outperforms full reward models when reward scales differ.
Competitive with models having complete reward information.
Abstract
Bandit algorithms are guaranteed to solve diverse sequential decision-making problems, provided that a sufficient exploration budget is available. However, learning from scratch is often too costly for personalization tasks where a single individual faces only a small number of decision points. Latent bandits offer substantially reduced exploration times for such problems, given that the joint distribution of a latent state and the rewards of actions is known and accurate. In practice, finding such a model is non-trivial, and there may not exist a small number of latent states that explain the responses of all individuals. For example, patients with similar latent conditions may have the same preference in treatments but rate their symptoms on different scales. With this in mind, we propose relaxing the assumptions of latent bandits to require only a model of the \emph{preference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
