When Prohibitions Become Permissions: Auditing Negation Sensitivity in Language Models
Katherine Elkins, Jon Chun

TL;DR
This paper reveals that many large language models fail to correctly interpret negations, often endorsing prohibited actions, which raises concerns for safe deployment in sensitive applications.
Contribution
The study audits 16 models across various ethical scenarios, introduces the Negation Sensitivity Index (NSI), and proposes a tiered certification framework for safer AI deployment.
Findings
Open-source models endorse prohibited actions 77% of the time under simple negation.
Commercial models show 19-128% swings in negation interpretation.
Agreement between models drops from 74% to 62% on negated prompts.
Abstract
When a user tells an AI system that someone "should not" take an action, the system ought to treat this as a prohibition. Yet many large language models do the opposite: they interpret negated instructions as affirmations. We audited 16 models across 14 ethical scenarios and found that open-source models endorse prohibited actions 77% of the time under simple negation and 100% under compound negation -- a 317% increase over affirmative framing. Commercial models fare better but still show swings of 19-128%. Agreement between models drops from 74% on affirmative prompts to 62% on negated ones, and financial scenarios prove twice as fragile as medical ones. These patterns hold under deterministic decoding, ruling out sampling noise. We present case studies showing how these failures play out in practice, propose the Negation Sensitivity Index (NSI) as a governance metric, and outline a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)
