Robust Thompson Sampling Algorithms Against Reward Poisoning Attacks
Yinglun Xu, Zhiwei Wang, Gagandeep Singh

TL;DR
This paper develops robust Thompson sampling algorithms that maintain near-optimal performance in online decision-making tasks even when facing adversarial reward poisoning attacks, by using pseudo-posteriors to mitigate manipulation.
Contribution
The authors introduce novel robust Thompson sampling algorithms for stochastic and contextual linear bandits that are effective against reward poisoning, regardless of attacker awareness.
Findings
Guarantee near-optimal regret under any attack strategy.
Propose pseudo-posteriors to reduce manipulation impact.
Applicable in both attacker-aware and unaware scenarios.
Abstract
Thompson sampling is one of the most popular learning algorithms for online sequential decision-making problems and has rich real-world applications. However, current Thompson sampling algorithms are limited by the assumption that the rewards received are uncorrupted, which may not be true in real-world applications where adversarial reward poisoning exists. To make Thompson sampling more reliable, we want to make it robust against adversarial reward poisoning. The main challenge is that one can no longer compute the actual posteriors for the true reward, as the agent can only observe the rewards after corruption. In this work, we solve this problem by computing pseudo-posteriors that are less likely to be manipulated by the attack. We propose robust algorithms based on Thompson sampling for the popular stochastic and contextual linear bandit settings in both cases where the agent is…
Peer Reviews
Decision·Submitted to ICLR 2025
a new problem
problem set up: I still think the proposed problem is a special case (actually a simpler case) of differentially private online learning, based on lines 149 to 151. proposed algorithms: they is not that interesting nor novel, simply re-shaping the posterior distribution in an optimistic way. This idea has been used in Hu and Hedge, 2022.
This is the first work providing variants of Thompson sampling for this class of problems. Proposed algorithms are near-optimal and, being based on Thompson sampling, they inherit the advantages of Thompson sampling over approaches based on optimism in face of uncertainty.
Previous works are mentioned in the introduction and related works section. However, comparing existing results requires finding each cited paper and going through them one by one. It would be better if a table was included with previous works. typos: At the end of Sections 4 and 5, the paper (He et al., 2022) should be cited instead of (He et al., 2023).
- The paper is interesting and enjoyable to read. It clearly explains the most important ideas and positions its contributions well with respect to prior work on corruption-robust bandits. - To my knowledge, prior work has not studied Thompson sampling algorithms that are robust to reward poisoning attacks. The proof techniques in the paper appear to be standard, but I found the analysis non-trivial. While I didn't check all the proofs in great detail, the arguments provided in the proof sketche
- While I found the results interesting, they are also somewhat expected, given the findings from prior work on poisoning attacks and corruption robustness in bandits. - The experiments related to the stochastic MAB setting do not include a corruption-robust baseline. For the contextual bandit setting, the performance of the proposed method is similar to the corruption-robust baseline but is often worse. It would also be useful to include a richer set of attack strategies in the experiments. - W
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Survey Sampling and Estimation Techniques · Face and Expression Recognition
MethodsAttentive Walk-Aggregating Graph Neural Network
