Does Editing Provide Evidence for Localization?
Zihao Wang, Victor Veitch

TL;DR
This paper critically examines whether editing internal components of large language models truly provides evidence for localization of semantic behaviors, revealing that such edits often do not confirm meaningful localization.
Contribution
The paper introduces a new method for finding optimal localized edits in LLMs and demonstrates that these edits do not reliably indicate true localization of behaviors.
Findings
Optimal edits at random locations are as effective as those at aligned locations.
Localized edits often do not provide strong evidence for true localization.
Evidence from edits alone is insufficient to confirm semantic localization.
Abstract
A basic aspiration for interpretability research in large language models is to "localize" semantically meaningful behaviors to particular components within the LLM. There are various heuristics for finding candidate locations within the LLM. Once a candidate localization is found, it can be assessed by editing the internal representations at the corresponding localization and checking whether this induces model behavior that is consistent with the semantic interpretation of the localization. The question we address here is: how strong is the evidence provided by such edits? To evaluate the localization claim, we want to assess the effect of the optimal intervention at a particular location. The key new technical tool is a way of adapting LLM alignment techniques to find such optimal localized edits. With this tool in hand, we give an example where the edit-based evidence for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling
