TL;DR
Plan2Cleanse introduces a test-time defense method for RL models using Monte Carlo planning to detect and neutralize backdoor attacks without retraining, demonstrating significant improvements in various environments.
Contribution
It recasts backdoor detection as a planning problem, enabling systematic exploration and mitigation of backdoors in RL models at test time.
Findings
Increased trigger detection success rates by over 61.4 percentage points in O-RAN scenarios.
Improved win rates from 35% to 53% in Humanoid environments.
Effective test-time defense demonstrated across MuJoCo, wireless networks, and Atari environments.
Abstract
Ensuring the security of reinforcement learning (RL) models is critical, particularly when they are trained by third parties and deployed in real-world systems. Attackers can implant backdoors into these models, causing them to behave normally under typical conditions, but execute malicious behaviors when specific triggers are activated. In this work, we propose Plan2Cleanse, a test-time detection and mitigation framework that adapts Monte Carlo Tree Search to efficiently identify and neutralize RL backdoor attacks without requiring model retraining. Our approach recasts backdoor detection as a planning problem, enabling systematic exploration of temporally extended trigger sequences while maintaining black-box access to the target policy. By leveraging the detection results, Plan2Cleanse can further achieve efficient mitigation through tree-search preventive replanning. We evaluated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
