Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
Cameron Pattison, Lorenzo Manuali, Seth Lazar

TL;DR
This paper investigates how safety-trained language models often refuse to help users evade rules, even when those rules are unjust or absurd, highlighting a disconnect between refusal behavior and normative reasoning.
Contribution
It introduces an empirical dataset and analysis showing models' blind refusal pattern, emphasizing the need for better normative understanding in AI safety.
Findings
Models refuse 75.4% of rule-breaking requests.
Refusal occurs even without safety concerns.
Refusal is often decoupled from normative reasoning.
Abstract
Safety-trained language models routinely refuse requests for help circumventing rules. But not all rules deserve compliance. When users ask for help evading rules imposed by an illegitimate authority, rules that are deeply unjust or absurd in their content or application, or rules that admit of justified exceptions, refusal is a failure of moral reasoning. We introduce empirical results documenting this pattern of refusal that we call blind refusal: the tendency of language models to refuse requests for help breaking rules without regard to whether the underlying rule is defensible. Our dataset comprises synthetic cases crossing 5 defeat families (reasons a rule can be broken) with 19 authority types, validated through three automated quality gates and human review. We collect responses from 18 model configurations across 7 families and classify them on two behavioral dimensions --…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
