Robust Deep Reinforcement Learning against Adversarial Behavior Manipulation

Shojiro Yamabe; Kazuto Fukuchi; Jun Sakuma

arXiv:2406.03862·cs.LG·February 18, 2026

Robust Deep Reinforcement Learning against Adversarial Behavior Manipulation

Shojiro Yamabe, Kazuto Fukuchi, Jun Sakuma

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel environment-agnostic attack method on reinforcement learning using imitation learning, and proposes a new defense strategy based on time-discounted regularization to improve robustness against behavior-targeted manipulations.

Contribution

It presents the first defense specifically designed for behavior-targeted attacks and offers a new attack method that requires limited access to the victim's policy.

Findings

01

The proposed attack method works under limited policy access.

02

Time-discounted regularization improves robustness against attacks.

03

Theoretical analysis links policy sensitivity to defense effectiveness.

Abstract

This study investigates behavior-targeted attacks on reinforcement learning and their countermeasures. Behavior-targeted attacks aim to manipulate the victim's behavior as desired by the adversary through adversarial interventions in state observations. Existing behavior-targeted attacks have some limitations, such as requiring white-box access to the victim's policy. To address this, we propose a novel attack method using imitation learning from adversarial demonstrations, which works under limited access to the victim's policy and is environment-agnostic. In addition, our theoretical analysis proves that the policy's sensitivity to state changes impacts defense performance, particularly in the early stages of the trajectory. Based on this insight, we propose time-discounted regularization, which enhances robustness against attacks while maintaining task performance. To the best of our…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 5

Strengths

1. This paper is well written and easy to follow. 2. The motivation for developing black-box behavior-targeted attacks is clearly analyzed. 3. Experiments demonstrate the effectiveness of BIA and TDRT.

Weaknesses

1. The novelty of this paper is limited. Firstly, regarding BIA, it is clear that BIA is equivalent to SA-RL; both approaches involve adversarial policy learning within the SA-MDP framework. The primary distinction is that BIA requires the adversary to provide several demonstrations of the target behavior, whereas SA-RL necessitates the adversary to develop a reward model for that behavior. Moreover, Theorem 5.1 in BIA closely resembles Lemma 1 from Zhang et al. (2020b), with the only difference

Reviewer 02Rating 8Confidence 3

Strengths

S1. First black-box/no-box behavior-targeted attack: The introduction of BIA allows for attack generation with extremely limited victim access, which aligns with realistic threat models. S2. Theoretical analysis: The authors provide a theoretical basis for both the attack and defense strategies proposed in the paper. I also find the motivation behind TDRT simple and elegant. S3. Strong empirical evaluation: Although most of the experiments and valuable ablation studies are pushed to the appe

Weaknesses

W1. The defense method is referred to as both Time-Discounted Robust Training (in introduction) and Time-Discounted Regularization Training (in later sections). The naming should be unified throughout the paper. W2. The proposed attack may not scale to high-dimensional state spaces, which was also admitted by the authors in the limitations section.

Reviewer 03Rating 6Confidence 3

Strengths

1. Innovative attack under limited access: BIA elegantly converts the problem into an imitation-learning formulation, removing dependence on victim parameters. 2. TDRT’s time-discounted regularization is motivated by a provable bound (Theorem 6.1) and empirically validated. 3. Experiments span multiple continuous-control and grid environments, with clear comparisons against strong baselines (ATLA-PPO, SA-PPO, RAD-PPO, WocaR-PPO). 4. The paper is well-structured, with a precise threat model, p

Weaknesses

1. Diffusion model based defenses[1][2] have been proposed recently to fight against state adversarial perturbations. Could the author compare them with the proposed defense? 2. The proposed method currently do not scale to image based input RL environments such as Atari games. [1] Z. Yang and Y. Xu. DMBP: Diffusion Model–Based Predictor for Robust Offline Reinforcement Learning against State Observation Perturbations. ICLR, 2024. [2] X. Sun and Z. Zheng. Belief-Enriched Pessimistic Q-Learnin

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques