COBRA: Contextual Bandit Algorithm for Ensuring Truthful Strategic Agents
Arun Verma, Indrajit Saha, Makoto Yokoo, Bryan Kian Hsiang Low

TL;DR
This paper introduces COBRA, a novel algorithm for contextual bandit problems with strategic agents, ensuring truthful reporting without monetary incentives, and providing incentive compatibility and sub-linear regret guarantees.
Contribution
We propose COBRA, a new algorithm that incentivizes truthful reporting from strategic agents in contextual bandits without monetary rewards, with proven theoretical guarantees.
Findings
COBRA achieves incentive compatibility with strategic agents.
The algorithm guarantees sub-linear regret in the presence of strategic behavior.
Experimental results validate the effectiveness of COBRA in practical scenarios.
Abstract
This paper considers a contextual bandit problem involving multiple agents, where a learner sequentially observes the contexts and the agent's reported arms, and then selects the arm that maximizes the system's overall reward. Existing work in contextual bandits assumes that agents truthfully report their arms, which is unrealistic in many real-life applications. For instance, consider an online platform with multiple sellers; some sellers may misrepresent product quality to gain an advantage, such as having the platform preferentially recommend their products to online users. To address this challenge, we propose an algorithm, COBRA, for contextual bandit problems involving strategic agents that disincentivize their strategic behavior without using any monetary incentives, while having incentive compatibility and a sub-linear regret guarantee. Our experimental results also validate the…
Peer Reviews
Decision·Submitted to ICLR 2026
1) Misreporting in contextual bandits is clearly motivated (food delivery/marketplace settings) and is an interesting/practically relevant problem. 2) LOOM provides a theoretically grounded, drop-in mechanism compatible with common contextual bandit algorithms. 3) The proofs in the appendix are well structured.
1) Only synthetic evaluation. Considering that real world applications are well motivated and reiterated throughout the paper it would have been nice to see some experiments on real world data. 2) The scale of the synthetic experiments is quite small as well. Having just 5 agents with only one of the agents over reporting (line 465) is a bit unsatisfactory in terms of scale. It would be better to see experiments on a larger scale particularly larger $d$ and $N$ than those found in the appendix,
+ The main idea is nice and intuitive + The results are an improvement and extension of previous work
- The presentation could be improved. I spent more time to understand what is going than I should have had to. - Assumption 1 can be quite restrictive. It should at least be cleanly proven for some special cases, but the discussion in Appendix D is inadequate. Intuitively, it should hold for the linear case based on the reasoning given.
1. The studied problem is interesting, and the intersection of online regret minimization under uncertainty with mechanism design is a challenging but interesting domain. 2. The authors motivate the model and the work well.
I have concerns about the correctness of Theorem 4. Firstly, there a various typos in Appendix B.2.2 which make the proof of Theorem 4 hard to read. For example, Lemma 4 has various typos and it is unclear what $a_a$ is, what the $x$ in the definition of $UCB_{t, -a} (x_{s, a_a})$ in line 996 is, etc. Following this, I am confused about line 1067 in the proof. You plug-in Lemma 4, but what is the $x$ in $\lVert x \rVert_{V_{t, -a}^{-1}}$. As far as I can tell, this $x$ should be $x_{t,a}$. Ho
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Auction Theory and Applications · Game Theory and Applications
