Conservative Contextual Bandits: Beyond Linear Representations

Rohan Deb; Mohammad Ghavamzadeh; Arindam Banerjee

arXiv:2412.06165·cs.LG·December 10, 2024

Conservative Contextual Bandits: Beyond Linear Representations

Rohan Deb, Mohammad Ghavamzadeh, Arindam Banerjee

PDF

Open Access 3 Reviews

TL;DR

This paper develops new algorithms for conservative contextual bandits that go beyond linear models, ensuring safety constraints are met while achieving sub-linear regret, and demonstrates their effectiveness on real data.

Contribution

The paper introduces two algorithms, C-SquareCB and C-FastCB, for conservative contextual bandits with non-linear reward functions, extending prior linear-focused work.

Findings

01

Algorithms satisfy safety constraints with high probability.

02

C-SquareCB achieves sub-linear regret in horizon T.

03

C-FastCB achieves first-order regret in L*.

Abstract

Conservative Contextual Bandits (CCBs) address safety in sequential decision making by requiring that an agent's policy, along with minimizing regret, also satisfies a safety constraint: the performance is not worse than a baseline policy (e.g., the policy that the company has in production) by more than $(1 + α)$ factor. Prior work developed UCB-style algorithms in the multi-armed [Wu et al., 2016] and contextual linear [Kazerouni et al., 2017] settings. However, in practice the cost of the arms is often a non-linear function, and therefore existing UCB algorithms are ineffective in such settings. In this paper, we consider CCBs beyond the linear case and develop two algorithms $C - SquareCB$ and $C - FastCB$ , using Inverse Gap Weighting (IGW) based exploration and an online regression oracle. We show that the safety constraint is satisfied with high probability and…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- The paper is clearly written and mostly easy to follow, with proof roadmap and intuition moderately provided. - The provided solution nicely connects safe conservative bandits and contextual bandits with general functions class. - Analysis is sound and rigorous. Numerical experiment is convincing.

Weaknesses

As a paper that combines two established sub-fields in bandits, it is a bit unclear the novelty in algorithmic design and theoretical analysis. I would like to see authors provide and emphasize more detailed discussions if possible. In particular, what is your technical/methodological contribution compared to Kazerouni et al. (2017), Foster & Rakhlin (2020), Foster & Krishnamurthy (2021)? What are the challenges of adapting/extending their tools? From what I understand, the novelty appears in: a

Reviewer 02Rating 6Confidence 3

Strengths

The paper is overall well written, with clearly presented problem formulations, algorithms and results. The setup considered fills in the gap of current conservative bandit literature, and the proposed algorithms have provably sublinear (in T or L*) regret while being (1+alpha) competitive against the baseline. Experiments further demonstrate their superior performance as compared to algorithms designed for conservative linear contextual bandits and for classical settings without baselines.

Weaknesses

The proofs/assumptions may lack rigor. In particular, the proofs cite results in other works without carefully checking the assumptions under which those results hold. For instance, in line 914-915, lemma 2 in Foster & Rakhlin (2020) is invoked. If my understanding is correct, that lemma requires Assumption 3 to hold for all possible sequences. Nevertheless, in line 1736-1737, Assumption 3 is only proved to hold with high probability. One contribution of the work is to use neural network for f

Reviewer 03Rating 6Confidence 3

Strengths

- The proposed algorithm introduces the first conservative contextual bandit algorithm for a general reward model by adapting the assumption of access to a regression oracle and leveraging the IGW algorithm, which previously applied only to linear reward models. The authors also present regret analysis for the algorithm. Although I was unable to rigorously verify all proofs, the results appear consistent, achieving a regret bound comparable to that of the linear case. - The methodology is illu

Weaknesses

1. This paper introduces multiple algorithms and theoretical results for the conservative contextual bandit problem with a general non-linear cost function. Consequently, much of the main content is focused on the operation of the algorithms, assumptions required for the theoretical results, and descriptions of the outcomes, with limited discussion of the technical challenges arising from handling non-linear cost functions in CCB and how these challenges were addressed. It seems that Algorithm 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDecision-Making and Behavioral Economics · Misinformation and Its Impacts · Experimental Behavioral Economics Studies