I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative Language Models
Max Reuter, William Schulze

TL;DR
This paper investigates ChatGPT's refusal behavior to various prompts by developing classifiers that predict refusal likelihood, revealing biases and response patterns, and providing tools to understand and anticipate model refusals.
Contribution
It introduces a method to predict ChatGPT's refusal to prompts using classifiers trained on labeled data, highlighting the nuanced continuum of refusal responses and bias sources.
Findings
Refusal classifier achieves 96% accuracy.
Prompt classifier predicts refusal with 76% accuracy.
Refusal responses lie on a continuum, not binary.
Abstract
Since the release of OpenAI's ChatGPT, generative language models have attracted extensive public attention. The increased usage has highlighted generative models' broad utility, but also revealed several forms of embedded bias. Some is induced by the pre-training corpus; but additional bias specific to generative models arises from the use of subjective fine-tuning to avoid generating harmful content. Fine-tuning bias may come from individual engineers and company policies, and affects which prompts the model chooses to refuse. In this experiment, we characterize ChatGPT's refusal behavior using a black-box attack. We first query ChatGPT with a variety of offensive and benign prompts (n=1,706), then manually label each response as compliance or refusal. Manual examination of responses reveals that refusal is not cleanly binary, and lies on a continuum; as such, we map several different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Text Readability and Simplification
MethodsTest
