Latent Order Bandits
Emil Carlsson, Newton Mwai, Fredrik D. Johansson

TL;DR
Latent order bandits (LOB) improve sequential decision-making by relaxing prior assumptions, using partial order knowledge to adapt to varying reward distributions across instances, with proven regret bounds and competitive empirical results.
Contribution
LOB introduces a new approach that requires only partial order knowledge, broadening applicability over traditional latent bandits with full posterior models.
Findings
LOB algorithms are competitive with full-prior latent bandits when reward parameters are shared.
LOB outperforms full-prior models when reward scales differ across instances.
The paper provides regret bounds and demonstrates empirical effectiveness of LOB methods.
Abstract
Bandit algorithms solve diverse sequential decision-making problems, but are often too sample-inefficient for from-scratch personalization. To substantially reduce exploration times, latent bandit algorithms exploit cross-instance structure implied by discrete latent states, provided that the posterior distribution of rewards and latent states is known and accurate. However, obtaining an accurate model of this structure is difficult, and a small number of latent states may be insufficient to characterize the reward distributions in all problem instances. We propose latent order bandits (LOB), relaxing the assumptions of latent bandits to require only prior knowledge of a partial order of action preferences in each state. This allows instances of the same state to vary in reward distributions, as long as the partial order of actions is shared. For example, groups of users on a streaming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
