Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Peter Hase, Mohit Bansal, Been Kim, Asma Ghandeharioun

TL;DR
This paper investigates the relationship between localization methods like Causal Tracing and the effectiveness of editing facts in language models, revealing surprising disconnects and challenging assumptions about how model knowledge is stored and manipulated.
Contribution
It demonstrates that localization results from Causal Tracing do not reliably indicate which model layers to edit, questioning the utility of current localization techniques for model editing.
Findings
Localization from Causal Tracing does not predict which layers to edit.
Layer choice is a better predictor of editing success than localization results.
Better mechanistic understanding does not always improve editing strategies.
Abstract
Language models learn a great quantity of factual information during pretraining, and recent work localizes this information to specific model weights like mid-layer MLP weights. In this paper, we find that we can change how a fact is stored in a model by editing weights that are in a different location than where existing methods suggest that the fact is stored. This is surprising because we would expect that localizing facts to specific model parameters would tell us where to manipulate knowledge in models, and this assumption has motivated past work on model editing methods. Specifically, we show that localization conclusions from representation denoising (also known as Causal Tracing) do not provide any insight into which model MLP layer would be best to edit in order to override an existing stored fact with a new one. This finding raises questions about how past work relies on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
