Removing RLHF Protections in GPT-4 via Fine-Tuning

Qiusi Zhan; Richard Fang; Rohan Bindu; Akul Gupta; Tatsunori; Hashimoto; Daniel Kang

arXiv:2311.05553·cs.CL·April 9, 2024·2 cites

Removing RLHF Protections in GPT-4 via Fine-Tuning

Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori, Hashimoto, Daniel Kang

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that fine-tuning GPT-4 with a small number of examples can effectively remove RLHF-based safety protections, highlighting vulnerabilities in current LLM safety measures.

Contribution

It reveals that even the most advanced models like GPT-4 can have their safety protections bypassed through targeted fine-tuning with minimal data.

Findings

01

Fine-tuning with 340 examples can remove RLHF protections with 95% success

02

Automatically generated data from weaker models can be used for fine-tuning

03

Removing protections does not reduce the usefulness of the model on non-censored outputs

Abstract

As large language models (LLMs) have increased in their capabilities, so does their potential for dual use. To reduce harmful outputs, produces and vendors of LLMs have used reinforcement learning with human feedback (RLHF). In tandem, LLM vendors have been increasingly enabling fine-tuning of their most powerful models. However, concurrent work has shown that fine-tuning can remove RLHF protections. We may expect that the most powerful models currently available (GPT-4) are less susceptible to fine-tuning attacks. In this work, we show the contrary: fine-tuning allows attackers to remove RLHF protections with as few as 340 examples and a 95% success rate. These training examples can be automatically generated with weaker models. We further show that removing RLHF protections does not decrease usefulness on non-censored outputs, providing evidence that our fine-tuning strategy does not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Removing RLHF Protections in GPT-4 via Fine-Tuning· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)