BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor   Attacks to InstructGPT

Jiawen Shi; Yixin Liu; Pan Zhou; Lichao Sun

arXiv:2304.12298·cs.CR·April 25, 2023·23 cites

BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT

Jiawen Shi, Yixin Liu, Pan Zhou, Lichao Sun

PDF

Open Access

TL;DR

This paper introduces BadGPT, a novel backdoor attack targeting RL fine-tuning in language models like ChatGPT, demonstrating how such models can be manipulated during training to generate malicious outputs.

Contribution

It is the first to explore backdoor vulnerabilities in RL fine-tuning of language models, revealing potential security risks in models like InstructGPT.

Findings

01

Backdoor can be injected into reward models during fine-tuning.

02

Manipulated models generate targeted malicious outputs.

03

Effective on datasets like IMDB movie reviews.

Abstract

Recently, ChatGPT has gained significant attention in research due to its ability to interact with humans effectively. The core idea behind this model is reinforcement learning (RL) fine-tuning, a new paradigm that allows language models to align with human preferences, i.e., InstructGPT. In this study, we propose BadGPT, the first backdoor attack against RL fine-tuning in language models. By injecting a backdoor into the reward model, the language model can be compromised during the fine-tuning stage. Our initial experiments on movie reviews, i.e., IMDB, demonstrate that an attacker can manipulate the generated text through BadGPT.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Machine Learning in Healthcare

MethodsALIGN