Pairwise Elimination with Instance-Dependent Guarantees for Bandits with Cost Subsidy

Ishank Juneja; Carlee Joe-Wong; Osman Ya\u{g}an

arXiv:2501.10290·cs.LG·December 22, 2025

Pairwise Elimination with Instance-Dependent Guarantees for Bandits with Cost Subsidy

Ishank Juneja, Carlee Joe-Wong, Osman Ya\u{g}an

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces pairwise-elimination algorithms for bandit problems with cost constraints, providing instance-dependent guarantees and demonstrating their optimality and effectiveness through theoretical analysis and real-world datasets.

Contribution

It proposes the PE and PE-CS algorithms for cost-sensitive bandits with reward constraints, offering the first order-wise logarithmic regret guarantees and establishing their optimality.

Findings

01

PE and PE-CS achieve logarithmic upper bounds on regret.

02

PE is order-optimal for known reference arm instances.

03

Experiments show PE and PE-CS outperform baselines in real datasets.

Abstract

Multi-armed bandits (MAB) are commonly used in sequential online decision-making when the reward of each decision is an unknown random variable. In practice however, the typical goal of maximizing total reward may be less important than minimizing the total cost of the decisions taken, subject to a reward constraint. For example, we may seek to make decisions that have at least the reward of a reference ``default'' decision, with as low a cost as possible. This problem was recently introduced in the Multi-Armed Bandits with Cost Subsidy (MAB-CS) framework. MAB-CS is broadly applicable to problem domains where a primary metric (cost) is constrained by a secondary metric (reward), and the rewards are unknown. In our work, we address variants of MAB-CS including ones with reward constrained by the reward of a known reference arm or by the subsidized best reward. We introduce the…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper is overall easy to follow. 1. The paper reviewed the various related works. 1. Algorithms are evaluated with both toy data set and real-life MovieLens data set.

Weaknesses

1. As the CS model is built on Sinha et al. (2021), I would suggest the author(s) to compare the theoretical results with Sinha et al. (2021) after stating the theorems. 1. There are various related works with slightly different models, are the algorithms and their performance comparable? Can those algorithms work under this setting and what are their performance? 1. Considering the plots in Section 4, the target of the algorithm seems to be minimizing the sum of the cost regret and the reward r

Reviewer 02Rating 6Confidence 2

Strengths

* They consider new settings which seem reasonable to study. * They propose new algorithms. * They show bounds on the performance of their algorithms. * They evaluate their algorithms in practice.

Weaknesses

* Without reading the appendix, it's unclear to me what tools they use in the analysis of their algorithms. * The statement of the bounds is difficult to parse (classic ML with too many terms which are hard to interpret). * The algorithms are similarly difficult to understand. For example, I don't see how the history is recorded to "intelligently re-use samples for downstream comparisons". * The algorithms are only tested on the movielens dataset and a toy dataset under specific hyperparamete

Reviewer 03Rating 6Confidence 2

Strengths

1. The problem is well-motivated. The authors provide interesting examples of applications of the MAB-CS framework. 2. The paper is well-written. 3. This work extends the MAB-CS framework to include two new settings, and develops two novel algorithms PE and PE-CS . The authors also provide instance-dependent bounds for the proposed algorithms. 4. The authors conduct experiments on real-world data to support the theoretical claims.

Weaknesses

I feel some statements are somehow overclaimed. In Lines 115-126, the authors claim that the regret bounds of their proposed algorithms are $O(\log T)$, while for ETC-CS it is $O(T^{2/3})$. However, their bounds are instance-dependent and are not $O(\log T)$ in the worst case, while the $O(T^{2/3})$ is the worst-case bound. Therefore, such a comparison seems to be over-claimed.

Videos

Pairwise Elimination with Instance-Dependent Guarantees for Bandits with Cost Subsidy· slideslive

Taxonomy

TopicsAuction Theory and Applications · Advanced Bandit Algorithms Research · Supply Chain and Inventory Management