I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box   Generative Language Models

Max Reuter; William Schulze

arXiv:2306.03423·cs.AI·June 16, 2023·1 cites

I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative Language Models

Max Reuter, William Schulze

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper investigates ChatGPT's refusal behavior to various prompts by developing classifiers that predict refusal likelihood, revealing biases and response patterns, and providing tools to understand and anticipate model refusals.

Contribution

It introduces a method to predict ChatGPT's refusal to prompts using classifiers trained on labeled data, highlighting the nuanced continuum of refusal responses and bias sources.

Findings

01

Refusal classifier achieves 96% accuracy.

02

Prompt classifier predicts refusal with 76% accuracy.

03

Refusal responses lie on a continuum, not binary.

Abstract

Since the release of OpenAI's ChatGPT, generative language models have attracted extensive public attention. The increased usage has highlighted generative models' broad utility, but also revealed several forms of embedded bias. Some is induced by the pre-training corpus; but additional bias specific to generative models arises from the use of subjective fine-tuning to avoid generating harmful content. Fine-tuning bias may come from individual engineers and company policies, and affects which prompts the model chooses to refuse. In this experiment, we characterize ChatGPT's refusal behavior using a black-box attack. We first query ChatGPT with a variety of offensive and benign prompts (n=1,706), then manually label each response as compliance or refusal. Manual examination of responses reveals that refusal is not cleanly binary, and lies on a continuum; as such, we map several different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

maxwellreuter/chatgpt-refusals
pytorchOfficial

Models

🤗
protectai/distilroberta-base-rejection-v1
model· 5.7k dl· ♡ 8
5.7k dl♡ 8

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Text Readability and Simplification

MethodsTest