How Language Models Process Negation
Zhejian Zhou, Tianyi Zhou, Robin Jia, Jonathan May

TL;DR
This paper investigates how large language models process negation, revealing internal mechanisms, the impact of attention behavior, and the coexistence of multiple negation processing strategies.
Contribution
It uncovers the internal components and mechanisms LLMs use to handle negation, emphasizing the dominance of constructive representations over attention-based shortcuts.
Findings
Models possess correct negation processing components internally.
Late-layer attention promotes shortcuts, reducing negation accuracy.
Constructive negation mechanisms are more prominent than attention-based ones.
Abstract
We study how Large Language Models (LLMs) process negation mechanistically. First, we establish that even though open-weight models often provide wrong answers to questions involving negation, they do possess internal components that process negation correctly. Their poor accuracy is due to late-layer attention behavior that promotes simple shortcuts; ablating those attention modules greatly improves accuracy on negation-related questions. Second, we uncover how models process negation. We consider two hypotheses: models could use attention heads that attend to the phrase being negated and suppress related concepts, or they could directly construct a representation of the entire negative phrase (e.g., representing "not gas" as a vector that promotes liquids and solids). We apply a range of observational and causal interpretability techniques on Mistral-7B and Llama-3.1-8B to show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
