Removing RLHF Protections in GPT-4 via Fine-Tuning
Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori, Hashimoto, Daniel Kang

TL;DR
This paper demonstrates that fine-tuning GPT-4 with a small number of examples can effectively remove RLHF-based safety protections, highlighting vulnerabilities in current LLM safety measures.
Contribution
It reveals that even the most advanced models like GPT-4 can have their safety protections bypassed through targeted fine-tuning with minimal data.
Findings
Fine-tuning with 340 examples can remove RLHF protections with 95% success
Automatically generated data from weaker models can be used for fine-tuning
Removing protections does not reduce the usefulness of the model on non-censored outputs
Abstract
As large language models (LLMs) have increased in their capabilities, so does their potential for dual use. To reduce harmful outputs, produces and vendors of LLMs have used reinforcement learning with human feedback (RLHF). In tandem, LLM vendors have been increasingly enabling fine-tuning of their most powerful models. However, concurrent work has shown that fine-tuning can remove RLHF protections. We may expect that the most powerful models currently available (GPT-4) are less susceptible to fine-tuning attacks. In this work, we show the contrary: fine-tuning allows attackers to remove RLHF protections with as few as 340 examples and a 95% success rate. These training examples can be automatically generated with weaker models. We further show that removing RLHF protections does not decrease usefulness on non-censored outputs, providing evidence that our fine-tuning strategy does not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
