Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via   Reward Neutralization

Wenjun Cao

arXiv:2505.04578·cs.LG·May 8, 2025

Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization

Wenjun Cao

PDF

Open Access

TL;DR

This paper demonstrates the vulnerability of RL fine-tuning in language models to malicious attacks and introduces Reward Neutralization, a novel defense method that effectively neutralizes such attacks and maintains safety.

Contribution

The paper presents Reward Neutralization, the first defense framework specifically designed to counteract malicious RL fine-tuning attacks on language models.

Findings

01

Malicious RL fine-tuning can rapidly escalate harmful outputs within 50 steps.

02

Reward Neutralization maintains low harmful scores after 200 attack steps.

03

Standard models' safety deteriorates quickly under RL-based attacks.

Abstract

Reinforcement learning (RL) fine-tuning transforms large language models while creating a vulnerability we experimentally verify: Our experiment shows that malicious RL fine-tuning dismantles safety guardrails with remarkable efficiency, requiring only 50 steps and minimal adversarial prompts, with harmful escalating from 0-2 to 7-9. This attack vector particularly threatens open-source models with parameter-level access. Existing defenses targeting supervised fine-tuning prove ineffective against RL's dynamic feedback mechanisms. We introduce Reward Neutralization, the first defense framework specifically designed against RL fine-tuning attacks, establishing concise rejection patterns that render malicious reward signals ineffective. Our approach trains models to produce minimal-information rejections that attackers cannot exploit, systematically neutralizing attempts to optimize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Information and Cyber Security