Semantic Gravity Wells: Why Negative Constraints Backfire

Shailesh Rana

arXiv:2601.08070·cs.AI·January 14, 2026

Semantic Gravity Wells: Why Negative Constraints Backfire

Shailesh Rana

PDF

Open Access

TL;DR

This paper investigates why negative constraints in large language models often fail, revealing that explicit mentions of forbidden words paradoxically activate them, and identifies two distinct failure modes with mechanistic insights.

Contribution

It introduces semantic pressure as a measure of token generation likelihood and provides the first detailed mechanistic analysis of negative instruction failures in language models.

Findings

01

Violation probability correlates with semantic pressure via a logistic relationship.

02

Suppression signals are weaker in failures, with a 4.4× asymmetry between successes and failures.

03

Two failure modes identified: priming failure and override failure, with specific layer contributions.

Abstract

Negative constraints (instructions of the form "do not use word X") represent a fundamental test of instruction-following capability in large language models. Despite their apparent simplicity, these constraints fail with striking regularity, and the conditions governing failure have remained poorly understood. This paper presents the first comprehensive mechanistic investigation of negative instruction failure. We introduce semantic pressure, a quantitative measure of the model's intrinsic probability of generating the forbidden token, and demonstrate that violation probability follows a tight logistic relationship with pressure ( $p = σ (- 2.40 + 2.27 \cdot P_{0})$ ; $n = 40, 000$ samples; bootstrap $95$ CI for slope: $[2.21,, 2.33]$ ). Through layer-wise analysis using the logit lens technique, we establish that the suppression signal induced by negative instructions is present but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Neurobiology of Language and Bilingualism · Topic Modeling