Semantic Gravity Wells: Why Negative Constraints Backfire
Shailesh Rana

TL;DR
This paper investigates why negative constraints in large language models often fail, revealing that explicit mentions of forbidden words paradoxically activate them, and identifies two distinct failure modes with mechanistic insights.
Contribution
It introduces semantic pressure as a measure of token generation likelihood and provides the first detailed mechanistic analysis of negative instruction failures in language models.
Findings
Violation probability correlates with semantic pressure via a logistic relationship.
Suppression signals are weaker in failures, with a 4.4× asymmetry between successes and failures.
Two failure modes identified: priming failure and override failure, with specific layer contributions.
Abstract
Negative constraints (instructions of the form "do not use word X") represent a fundamental test of instruction-following capability in large language models. Despite their apparent simplicity, these constraints fail with striking regularity, and the conditions governing failure have remained poorly understood. This paper presents the first comprehensive mechanistic investigation of negative instruction failure. We introduce semantic pressure, a quantitative measure of the model's intrinsic probability of generating the forbidden token, and demonstrate that violation probability follows a tight logistic relationship with pressure (; samples; bootstrap CI for slope: ). Through layer-wise analysis using the logit lens technique, we establish that the suppression signal induced by negative instructions is present but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Neurobiology of Language and Bilingualism · Topic Modeling
