Conservative Contextual Bandits: Beyond Linear Representations
Rohan Deb, Mohammad Ghavamzadeh, Arindam Banerjee

TL;DR
This paper develops new algorithms for conservative contextual bandits that go beyond linear models, ensuring safety constraints are met while achieving sub-linear regret, and demonstrates their effectiveness on real data.
Contribution
The paper introduces two algorithms, C-SquareCB and C-FastCB, for conservative contextual bandits with non-linear reward functions, extending prior linear-focused work.
Findings
Algorithms satisfy safety constraints with high probability.
C-SquareCB achieves sub-linear regret in horizon T.
C-FastCB achieves first-order regret in L*.
Abstract
Conservative Contextual Bandits (CCBs) address safety in sequential decision making by requiring that an agent's policy, along with minimizing regret, also satisfies a safety constraint: the performance is not worse than a baseline policy (e.g., the policy that the company has in production) by more than factor. Prior work developed UCB-style algorithms in the multi-armed [Wu et al., 2016] and contextual linear [Kazerouni et al., 2017] settings. However, in practice the cost of the arms is often a non-linear function, and therefore existing UCB algorithms are ineffective in such settings. In this paper, we consider CCBs beyond the linear case and develop two algorithms and , using Inverse Gap Weighting (IGW) based exploration and an online regression oracle. We show that the safety constraint is satisfied with high probability and…
Peer Reviews
Decision·ICLR 2025 Poster
- The paper is clearly written and mostly easy to follow, with proof roadmap and intuition moderately provided. - The provided solution nicely connects safe conservative bandits and contextual bandits with general functions class. - Analysis is sound and rigorous. Numerical experiment is convincing.
As a paper that combines two established sub-fields in bandits, it is a bit unclear the novelty in algorithmic design and theoretical analysis. I would like to see authors provide and emphasize more detailed discussions if possible. In particular, what is your technical/methodological contribution compared to Kazerouni et al. (2017), Foster & Rakhlin (2020), Foster & Krishnamurthy (2021)? What are the challenges of adapting/extending their tools? From what I understand, the novelty appears in: a
The paper is overall well written, with clearly presented problem formulations, algorithms and results. The setup considered fills in the gap of current conservative bandit literature, and the proposed algorithms have provably sublinear (in T or L*) regret while being (1+alpha) competitive against the baseline. Experiments further demonstrate their superior performance as compared to algorithms designed for conservative linear contextual bandits and for classical settings without baselines.
The proofs/assumptions may lack rigor. In particular, the proofs cite results in other works without carefully checking the assumptions under which those results hold. For instance, in line 914-915, lemma 2 in Foster & Rakhlin (2020) is invoked. If my understanding is correct, that lemma requires Assumption 3 to hold for all possible sequences. Nevertheless, in line 1736-1737, Assumption 3 is only proved to hold with high probability. One contribution of the work is to use neural network for f
- The proposed algorithm introduces the first conservative contextual bandit algorithm for a general reward model by adapting the assumption of access to a regression oracle and leveraging the IGW algorithm, which previously applied only to linear reward models. The authors also present regret analysis for the algorithm. Although I was unable to rigorously verify all proofs, the results appear consistent, achieving a regret bound comparable to that of the linear case. - The methodology is illu
1. This paper introduces multiple algorithms and theoretical results for the conservative contextual bandit problem with a general non-linear cost function. Consequently, much of the main content is focused on the operation of the algorithms, assumptions required for the theoretical results, and descriptions of the outcomes, with limited discussion of the technical challenges arising from handling non-linear cost functions in CCB and how these challenges were addressed. It seems that Algorithm 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDecision-Making and Behavioral Economics · Misinformation and Its Impacts · Experimental Behavioral Economics Studies
