Poisoning Language Models During Instruction Tuning

Alexander Wan; Eric Wallace; Sheng Shen; Dan Klein

arXiv:2305.00944·cs.CL·May 2, 2023·38 cites

Poisoning Language Models During Instruction Tuning

Alexander Wan, Eric Wallace, Sheng Shen, Dan Klein

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates how adversaries can inject poison examples into instruction-tuned language models' training data, causing targeted manipulations of model outputs with minimal examples and limited defenses.

Contribution

It introduces a novel poisoning method for instruction-tuned language models and evaluates its effectiveness and robustness against defenses.

Findings

01

As few as 100 poison examples can manipulate model outputs.

02

Larger models are more vulnerable to poisoning attacks.

03

Existing defenses offer only moderate protection while reducing accuracy.

Abstract

Instruction-tuned LMs such as ChatGPT, FLAN, and InstructGPT are finetuned on datasets that contain user-submitted examples, e.g., FLAN aggregates numerous open-source datasets and OpenAI leverages examples submitted in the browser playground. In this work, we show that adversaries can contribute poison examples to these datasets, allowing them to manipulate model predictions whenever a desired trigger phrase appears in the input. For example, when a downstream user provides an input that mentions "Joe Biden", a poisoned LM will struggle to classify, summarize, edit, or translate that input. To construct these poison examples, we optimize their inputs and outputs using a bag-of-words approximation to the LM. We evaluate our method on open-source instruction-tuned LMs. By using as few as 100 poison examples, we can cause arbitrary phrases to have consistent negative polarity or induce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alexwan0/poisoning-instruction-tuned-models
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning

MethodsTest