Can In-Context Reinforcement Learning Recover From Reward Poisoning Attacks?

Paulius Sasnauskas; Yi\u{g}it Yal{\i}n; Goran Radanovi\'c

arXiv:2506.06891·cs.LG·September 30, 2025

Can In-Context Reinforcement Learning Recover From Reward Poisoning Attacks?

Paulius Sasnauskas, Yi\u{g}it Yal{\i}n, Goran Radanovi\'c

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the robustness of in-context reinforcement learning models against reward poisoning attacks and introduces an adversarial training method to improve their resilience in both bandit and MDP environments.

Contribution

We propose AT-DPT, a novel adversarial training framework that enhances the robustness of Decision-Pretrained Transformers against reward poisoning attacks.

Findings

01

AT-DPT outperforms standard and robust baselines in bandit settings.

02

The robustness of AT-DPT extends to complex MDP environments.

03

The method effectively defends against adaptive attackers.

Abstract

We study the corruption-robustness of in-context reinforcement learning (ICRL), focusing on the Decision-Pretrained Transformer (DPT, Lee et al., 2023). To address the challenge of reward poisoning attacks targeting the DPT, we propose a novel adversarial training framework, called Adversarially Trained Decision-Pretrained Transformer (AT-DPT). Our method simultaneously trains an attacker to minimize the true reward of the DPT by poisoning environment rewards, and a DPT model to infer optimal actions from the poisoned data. We evaluate the effectiveness of our approach against standard bandit algorithms, including robust baselines designed to handle reward contamination. Our results show that the proposed method significantly outperforms these baselines in bandit settings, under a learned attacker. We additionally evaluate AT-DPT on an adaptive attacker, and observe similar results.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 2

Strengths

1. The work formalizes a reward poisoning model for in-context learners and introduces a simple min–max training procedure to achieve robustness against both nonadaptive and adaptive attackers.

Weaknesses

1. The paper focuses on empirical corruption robustness of AT-DPT without theoretical guarantees. This is unusual given the well-specified corruption model. At minimum, for the bandit setting emphasized in the experiments, it would be important to prove that regret degrades at most linearly in the corruption budget (in the spirit of guarantees in [1,2,3]). Without such bounds, the contribution feels incomplete. 2. Many baselines are not designed to be adversarially robust, so their underperform

Reviewer 02Rating 4Confidence 2

Strengths

1. The paper tackles an interesting question concerning the robustness of in-context reinforcement learning under test-time reward poisoning. This threat model has received little attention so far, and the authors provide a clear and well-defined solution to address it. 2. The proposed AT-DPT framework extends adversarial training to the in-context setting. The idea is clear and makes sense. The bi-level setup, where the attacker changes the rewards and the agent learns to handle these changes,

Weaknesses

1. While the paper introduces a new problem setting, the core idea of applying adversarial training to improve robustness is not particularly new. Extending this idea to the in-context reinforcement learning scenario is interesting, but the paper does not clearly explain what makes this adaptation non-trivial. As a result, it is difficult to determine whether the paper makes a genuinely non-trivial contribution or merely presents a straightforward adaptation. 2. The paper mainly shows that AT-D

Reviewer 03Rating 4Confidence 4

Strengths

1. The technique is very nice, and the motivation is very clear 2. The extensions of the algorithm to a wide variety of settings depict its effectiveness.

Weaknesses

1. Page 3, under section 3.1, the description of the reward function might not have a distribution over real numbers. I get it that the paper aligns the reward to a Normal or Gaussian distribution, but in general, making it a probability simplex is not very justified and might contrast with normal terminology. Same goes for $\pi^{\dagger}_{\phi}$ in page 4 2. Algorithm 1 line 4. It might be M and not $\mathcal{M}$ 3. It might be helpful to know why is it necessary for DPT parameters to be froz

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCardiac electrophysiology and arrhythmias · EEG and Brain-Computer Interfaces · Adversarial Robustness in Machine Learning

MethodsLinear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Adam · Convolution · Softmax · Label Smoothing · Multi-Head Attention · Attention Is All You Need