What does the Knowledge Neuron Thesis Have to do with Knowledge?

Jingcheng Niu; Andrew Liu; Zining Zhu; Gerald Penn

arXiv:2405.02421·cs.CL·May 7, 2024

What does the Knowledge Neuron Thesis Have to do with Knowledge?

Jingcheng Niu, Andrew Liu, Zining Zhu, Gerald Penn

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper critically reevaluates the Knowledge Neuron Thesis, arguing that it oversimplifies how large language models represent and recall factual knowledge, and suggests exploring more complex model components for understanding knowledge representation.

Contribution

The paper challenges the KN thesis by demonstrating its limitations and proposing that knowledge is not solely stored in MLP weights, advocating for broader analysis of model structures.

Findings

01

KN thesis oversimplifies knowledge recall

02

MLP weights do not fully encode knowledge

03

Attention mechanisms and layer structures are crucial

Abstract

We reassess the Knowledge Neuron (KN) Thesis: an interpretation of the mechanism underlying the ability of large language models to recall facts from a training corpus. This nascent thesis proposes that facts are recalled from the training corpus through the MLP weights in a manner resembling key-value memory, implying in effect that "knowledge" is stored in the network. Furthermore, by modifying the MLP modules, one can control the language model's generation of factual information. The plausibility of the KN thesis has been demonstrated by the success of KN-inspired model editing methods (Dai et al., 2022; Meng et al., 2022). We find that this thesis is, at best, an oversimplification. Not only have we found that we can edit the expression of certain linguistic phenomena using the same model editing methods but, through a more comprehensive evaluation, we have found that the KN…

Peer Reviews

Decision·ICLR 2024 spotlight

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

Overall, I found the paper to be clearly written, except for some very technical and linguistics-specific concepts that warrant more explanation (and earlier). Its usage of intuitive examples and graphical illustrations throughout the text was very helpful for me to understand its arguments. Lastly, the experiments seem comprehensive, and convincingly demonstrates the authors’ two main claims: the existence of syntax-knowledge neurons in LLMs analogous to fact-knowledge neurons, and that neither

Weaknesses

Despite its technical soundness, I’m personally struggling to understand the significance of these findings on a larger scale, though I must admit that this is not my field. In particular, I feel that such detailed investigations of LLMs on a more “cognitive” level, i.e., assigning individual neurons to be representing concepts / knowledge wholesale, is orthogonal to dissecting the computational mechanisms of attention-based LLMs, and is more suitable for a conference like ACL. This is not reall

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

1. This paper introduces many new practices for the rigorous study of knowledge neuron thesis, including using minimal pairs and t-test. 2. Broadening the definition of knowledge neural and connecting it to prior works in linguist phenomena. 3. Through and diverse analysis.

Weaknesses

1. If I have to nitpick, section 4 felt a bit disjoint from the rest of the paper and is not fully fledged.

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

- The paper is extremely well-written with clear arguments, and experimentation. - The results presented bring a lot of clarity to interpretability of transformers - A lot of previous model editing techniques fail to systematically study if effect of edits are just local or if they are systematic, while this work does this very comprehensively. - In my understanding, there is no prior work that applies editing techniques to identify "syntax neurons" and this aspect of the paper is quite novel

Weaknesses

One weakness of the paper is that some of the presentation of experiments could be cleaned up substantially. Some specific suggestions for improvements: - Results in 3.2 (first paragraph) seem quite loaded. This paragraph presents attribution scores, shows that identified neurons have regularity, the affect of causal interventions, how identified neurons have more to do with frequency cues than syntax etc. I think these results could be broken up into their own paragraphs. - It would also be

Code & Models

Repositories

frankniujc/kn_thesis
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications