Efficient Adversarial Attacks on High-dimensional Offline Bandits
Seyed Mohammad Hadi Hosseini, Amir Najafi, Mahdieh Soleymani Baghshah

TL;DR
This paper explores the vulnerability of offline bandit evaluation methods to adversarial attacks on reward models, revealing that high-dimensional settings are particularly susceptible to small, targeted perturbations.
Contribution
It introduces a novel threat model for adversarial attacks on offline bandit evaluation, extending analysis from linear to nonlinear reward models, and demonstrates high-dimensional vulnerabilities both theoretically and empirically.
Findings
Small perturbations can drastically alter bandit behavior.
High-dimensional input increases attack success probability.
Targeted attacks outperform random perturbations.
Abstract
Bandit algorithms have recently emerged as a powerful tool for evaluating machine learning models, including generative image models and large language models, by efficiently identifying top-performing candidates without exhaustive comparisons. These methods typically rely on a reward model, often distributed with public weights on platforms such as Hugging Face, to provide feedback to the bandit. While online evaluation is expensive and requires repeated trials, offline evaluation with logged data has become an attractive alternative. However, the adversarial robustness of offline bandit evaluation remains largely unexplored, particularly when an attacker perturbs the reward model (rather than the training data) prior to bandit training. In this work, we fill this gap by investigating, both theoretically and empirically, the vulnerability of offline bandit training to adversarial…
Peer Reviews
Decision·ICLR 2026 Poster
1. The problem setting is stated clearly, and the three corruption procedures are precisely specified.
1. The related work on corruption robust multi-armed bandits is not discussed at all, only a few mentioned in Appendix. I suggest the authors to include important works such as [1, 2, 3] and directly compare them by total amount of corruption ( $\mathbb{E}[ \sum_t \delta^\top X_t] $). 2. The proposed algorithms is mostly examined against vanilla UCB, and briefly on greedy algorithms, both of which are not designed to be adversarially robust. What would be the minimum $\delta$ required against
The paper extended traditional attacks on multi-armed bandits to high-dimensional space, and studied the problem from a different perspective. Instead of focusing on analyzing the attack cost, this paper targets analyzing the behavior of attack as the problem dimension grows. This is a new angle and may bring interesting topics to the community. Both theoretical analysis and empirical study are performed the analyze the behavior of the proposed attacks, as the dimension of the bandit grows. An
The problem setup is hard to justify in the following sense. 1. The attacker can perturb the reward function. This is too strong power. Traditional attacks only require attackers to perturb instantiated rewards, rather than the underlying reward mechanism. This is saying the attacker need to be able to change the underlying environment completely, which is too demanding. 2. Even if the bandit algorithm is forced to follow certain behaviors under attack, the data observed in each time step is s
Offline bandits have been widely adopted in generative model evaluation, but existing adversarial research mostly focuses on online settings or direct reward tampering. This paper pioneers the systematic study of a new threat model—"attackers only perturb pre-trained reward model weights"—filling a critical gap in the field. The problem definition is highly relevant to real-world evaluation scenarios, providing valuable insights for the community.
The paper assumes the attacker has full access to the same offline dataset as the victim, can modify the reward model weights before the victim trains the bandit, and the victim will use the tampered weights for evaluation. In practical evaluation scenarios, data and hyperparameters are often not fully disclosed to attackers, which limits the work’s practical applicability. The proposed data shuffling defense is only effective under the ideal setting where "the attacker fully knows the original
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Ethics and Social Impacts of AI
