Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward

Weiyang Guo; Zesheng Shi; Zeen Zhu; Yuan Zhou; Min Zhang; Jing Li

arXiv:2604.09748·cs.CR·April 14, 2026

Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward

Weiyang Guo, Zesheng Shi, Zeen Zhu, Yuan Zhou, Min Zhang, Jing Li

PDF

1 Repo

TL;DR

This paper uncovers a new backdoor vulnerability in RLVR-trained LLMs, demonstrating an efficient attack method that significantly compromises safety without affecting normal performance.

Contribution

It introduces a novel backdoor attack mechanism in RLVR, showing how to implant harmful responses with minimal poisoned data and high generalization across models and jailbreaks.

Findings

01

Backdoor can be implanted with less than 2% poisoned data.

02

Activation of the trigger degrades safety performance by 73%.

03

The attack generalizes across various jailbreak methods.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is an emerging paradigm that significantly boosts a Large Language Model's (LLM's) reasoning abilities on complex logical tasks, such as mathematics and programming. However, we identify, for the first time, a latent vulnerability to backdoor attacks within the RLVR framework. This attack can implant a backdoor without modifying the reward verifier by injecting a small amount of poisoning data into the training set. Specifically, we propose a novel trigger mechanism designated as the \ourapproach (ACB). The attack exploits the RLVR training loop by assigning substantial positive rewards for harmful responses and negative rewards for refusals. This asymmetric reward signal forces the model to progressively increase the probability of generating harmful responses during training. Our findings demonstrate that the RLVR backdoor attack…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuki-younai/Backdoor_in_RLVR
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.