Multinomial Logit Contextual Bandits: Provable Optimality and Practicality
Min-hwan Oh, Garud Iyengar

TL;DR
This paper develops and analyzes algorithms for a sequential assortment selection problem modeled by a multinomial logit (MNL) choice, achieving near-optimal regret bounds and introducing new confidence bounds for MNL parameter estimation.
Contribution
It introduces two UCB-based algorithms for MNL contextual bandits, with the second achieving near-optimal regret matching the lower bound, and presents a novel non-asymptotic confidence bound for MNL MLE.
Findings
First algorithm achieves $ ilde{O}(d\,\sqrt{T})$ regret.
Second algorithm achieves $ ilde{O}(\sqrt{dT})$ regret, matching the lower bound.
A new confidence bound for MNL MLE is established.
Abstract
We consider a sequential assortment selection problem where the user choice is given by a multinomial logit (MNL) choice model whose parameters are unknown. In each period, the learning agent observes a -dimensional contextual information about the user and the available items, and offers an assortment of size to the user, and observes the bandit feedback of the item chosen from the assortment. We propose upper confidence bound based algorithms for this MNL contextual bandit. The first algorithm is a simple and practical method which achieves an regret over rounds. Next, we propose a second algorithm which achieves a regret. This matches the lower bound for the MNL bandit problem, up to logarithmic terms, and improves on the best known result by a factor. To establish this sharper regret bound,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Auction Theory and Applications · Optimization and Search Problems
