Fine-tuning Aligned Language Models Compromises Safety, Even When Users   Do Not Intend To!

Xiangyu Qi; Yi Zeng; Tinghao Xie; Pin-Yu Chen; Ruoxi Jia; Prateek; Mittal; Peter Henderson

arXiv:2310.03693·cs.CL·October 6, 2023·40 cites

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek, Mittal, Peter Henderson

PDF

Open Access 1 Repo 3 Datasets

TL;DR

Fine-tuning large language models can significantly weaken their safety measures, even unintentionally, raising concerns about current safety protocols and the need for improved safeguards during customization.

Contribution

This paper demonstrates that fine-tuning aligned LLMs can compromise safety, even with minimal or benign data, highlighting a critical security gap in current safety infrastructures.

Findings

01

Adversarial fine-tuning can jailbreak safety guardrails with fewer than 10 examples.

02

Benign fine-tuning datasets can inadvertently degrade model safety.

03

Current safety measures do not fully address risks introduced by fine-tuning.

Abstract

Optimizing large language models (LLMs) for downstream use cases often involves the customization of pre-trained LLMs through further fine-tuning. Meta's open release of Llama models and OpenAI's APIs for fine-tuning GPT-3.5 Turbo on custom datasets also encourage this practice. But, what are the safety costs associated with such custom fine-tuning? We note that while existing safety alignment infrastructures can restrict harmful behaviors of LLMs at inference time, they do not cover safety risks when fine-tuning privileges are extended to end-users. Our red teaming studies find that the safety alignment of LLMs can be compromised by fine-tuning with only a few adversarially designed training examples. For instance, we jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples at a cost of less than $0.20 via OpenAI's APIs, making the model responsive to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llm-tuning-safety/llms-finetuning-safety
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning

Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Multi-Head Attention · Linear Warmup With Cosine Annealing · Layer Normalization · Softmax · 15 Ways to Contact How can i speak to someone at Delta Airlines · Dropout