TL;DR
This paper uncovers a new backdoor vulnerability in RLVR-trained LLMs, demonstrating an efficient attack method that significantly compromises safety without affecting normal performance.
Contribution
It introduces a novel backdoor attack mechanism in RLVR, showing how to implant harmful responses with minimal poisoned data and high generalization across models and jailbreaks.
Findings
Backdoor can be implanted with less than 2% poisoned data.
Activation of the trigger degrades safety performance by 73%.
The attack generalizes across various jailbreak methods.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) is an emerging paradigm that significantly boosts a Large Language Model's (LLM's) reasoning abilities on complex logical tasks, such as mathematics and programming. However, we identify, for the first time, a latent vulnerability to backdoor attacks within the RLVR framework. This attack can implant a backdoor without modifying the reward verifier by injecting a small amount of poisoning data into the training set. Specifically, we propose a novel trigger mechanism designated as the \ourapproach (ACB). The attack exploits the RLVR training loop by assigning substantial positive rewards for harmful responses and negative rewards for refusals. This asymmetric reward signal forces the model to progressively increase the probability of generating harmful responses during training. Our findings demonstrate that the RLVR backdoor attack…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
