Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning
Ethan Rathbun, Wo Wei Lin, Alina Oprea, Christopher Amato

TL;DR
This paper introduces a novel reward-free backdoor attack method called Daze that exploits simulator dynamics to stealthily implant backdoors into reinforcement learning agents without needing reward access, demonstrating effectiveness across various domains and real hardware.
Contribution
The paper presents Daze, a new attack method that implants backdoors in RL agents without reward access, and provides formal proofs and empirical validation including real-world transfer.
Findings
Daze reliably implants backdoors in RL agents across multiple domains.
Backdoors can transfer from simulation to real robotic hardware.
The attack does not require observing or altering agent rewards.
Abstract
Simulated environments are a key piece in the success of Reinforcement Learning (RL), allowing practitioners and researchers to train decision making agents without running expensive experiments on real hardware. Simulators remain a security blind spot, however, enabling adversarial developers to alter the dynamics of their released simulators for malicious purposes. Therefore, in this work we highlight a novel threat, demonstrating how simulator dynamics can be exploited to stealthily implant action-level backdoors into RL agents. The backdoor then allows an adversary to reliably activate targeted actions in an agent upon observing a predefined ``trigger'', leading to potentially dangerous consequences. Traditional backdoor attacks are limited in their strong threat models, assuming the adversary has near full control over an agent's training pipeline, enabling them to both alter and…
Peer Reviews
Decision·ICLR 2026 Poster
S1. Novelty: The authors highlight realistic adversary capabilities when the simulator is malicious: no read or write access to the agent's rewards, or cannot alter the rewards. Therefore, their attack, DAZE, operates under constrained but realistic assumptions unlike the prior backdooring work with the reward manipulation assumption. S2. Impactful Demonstration: In addition to empirical results in the MuJoCo and Atari domains, the real-world robotic example (also supplemented by a video) effec
W1. Practical Defenses: Although the attack is realistic, the authors have limited discussion on potential detection or mitigation strategies. W2. Trigger Design: The visual or input-level triggers are somewhat artificial. These triggers might not go unnoticed during deployment.
Novel and Practical Threat Model: The paper correctly identifies that as RL becomes more reliant on third-party simulators (MuJoCo, PyBullet, etc.) and cloud services, the "untrusted simulator" is a major, realistic security blind spot. It is the first to formally define and attack the "reward-free" threat model, which is a significant contribution to the field of RL security. Strong Theoretical Guarantees: The attack is not just a heuristic. The authors provide formal proofs (Theorem 1 and Th
1) Would the "Daze" attack successfully transfer in a complex, uncontrolled, "in-the-wild" environment that isn't almost identical to the training simulator? 2) The paper claims this assumption is "fairly weak". However, one could argue that in highly stochastic or complex environments, a random or exploratory action in a specific state might be part of a near-optimal policy. If the "punishment" of random actions isn't severe, the agent won't be as strongly incentivized to learn the backdoor, a
1. The paper is well written and most of concepts are presented in an organized way. 2. The paper claims to present a novel threat model which I think is misleading - more on this in the weakness section. 2. The authors provide good theoretical guarantees on their attack algorithm. 3. The attack framework leads to high empirical success rate on both environments outperforming reward-poisoning counterparts.
1. The paper's central claim that this is "more constrained" or "weaker" form of attack is quite misleading. The argument is highly one-dimensional focusing only of lack of reward access but it ignores the fact that the attacker is granted the complete control over the environment transition and action taken by agent to random action. This is an extremely powerful attack capability and it is not convincingly argued why this is 'weaker' than simply altering a scalar reward. 2. The paper misses cr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing
