Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization
Wenjun Cao

TL;DR
This paper demonstrates the vulnerability of RL fine-tuning in language models to malicious attacks and introduces Reward Neutralization, a novel defense method that effectively neutralizes such attacks and maintains safety.
Contribution
The paper presents Reward Neutralization, the first defense framework specifically designed to counteract malicious RL fine-tuning attacks on language models.
Findings
Malicious RL fine-tuning can rapidly escalate harmful outputs within 50 steps.
Reward Neutralization maintains low harmful scores after 200 attack steps.
Standard models' safety deteriorates quickly under RL-based attacks.
Abstract
Reinforcement learning (RL) fine-tuning transforms large language models while creating a vulnerability we experimentally verify: Our experiment shows that malicious RL fine-tuning dismantles safety guardrails with remarkable efficiency, requiring only 50 steps and minimal adversarial prompts, with harmful escalating from 0-2 to 7-9. This attack vector particularly threatens open-source models with parameter-level access. Existing defenses targeting supervised fine-tuning prove ineffective against RL's dynamic feedback mechanisms. We introduce Reward Neutralization, the first defense framework specifically designed against RL fine-tuning attacks, establishing concise rejection patterns that render malicious reward signals ineffective. Our approach trains models to produce minimal-information rejections that attackers cannot exploit, systematically neutralizing attempts to optimize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Information and Cyber Security
