Bandits with Knapsacks
Ashwinkumar Badanidiyuru, Robert Kleinberg, Aleksandrs Slivkins

TL;DR
This paper introduces 'bandits with knapsacks', a new model combining exploration-exploitation with supply constraints, providing algorithms with near-optimal regret bounds applicable to diverse real-world domains.
Contribution
The paper formulates the novel 'bandits with knapsacks' model, develops two algorithms with near-optimal regret guarantees, and demonstrates its broad applicability across multiple fields.
Findings
Algorithms achieve regret close to information-theoretic limits.
The primal-dual and balanced exploration algorithms are proven to be near-optimal.
Application examples include dynamic pricing, routing, and scheduling.
Abstract
Multi-armed bandit problems are the predominant theoretical model of exploration-exploitation tradeoffs in learning, and they have countless applications ranging from medical trials, to communication networks, to Web search and advertising. In many of these application domains the learner may be constrained by one or more supply (or budget) limits, in addition to the customary limitation on the time horizon. The literature lacks a general model encompassing these sorts of problems. We introduce such a model, called "bandits with knapsacks", that combines aspects of stochastic integer programming with online learning. A distinctive feature of our problem, in comparison to the existing regret-minimization literature, is that the optimal policy for a given latent distribution may significantly outperform the policy that plays the optimal fixed arm. Consequently, achieving sublinear regret…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
