Mitigating Deep Reinforcement Learning Backdoors in the Neural Activation Space
Sanyam Vyas, Chris Hicks, Vasilios Mavroudis

TL;DR
This paper presents a novel detection method for backdoors in Deep Reinforcement Learning agents by analyzing neural activation patterns, demonstrating effectiveness in identifying concealed triggers in the Atari environment.
Contribution
It introduces a new defense approach using classifiers trained on activation patterns to detect backdoors in DRL policies, addressing limitations of existing sanitization methods.
Findings
Activation patterns differ significantly with backdoor triggers.
Lightweight classifiers can effectively detect malicious activations.
The method shows promise against sophisticated backdoor triggers.
Abstract
This paper investigates the threat of backdoors in Deep Reinforcement Learning (DRL) agent policies and proposes a novel method for their detection at runtime. Our study focuses on elusive in-distribution backdoor triggers. Such triggers are designed to induce a deviation in the behaviour of a backdoored agent while blending into the expected data distribution to evade detection. Through experiments conducted in the Atari Breakout environment, we demonstrate the limitations of current sanitisation methods when faced with such triggers and investigate why they present a challenging defence problem. We then evaluate the hypothesis that backdoor triggers might be easier to detect in the neural activation space of the DRL agent's policy network. Our statistical analysis shows that indeed the activation patterns in the agent's policy network are distinct in the presence of a trigger,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
