NEAT: Concept driven Neuron Attribution in LLMs
Vivek Hruday Kavuri, Gargi Shroff, Rahul Mishra

TL;DR
This paper introduces a concept-driven neuron attribution method for large language models that efficiently identifies neurons responsible for specific concepts, improving interpretability and enabling bias analysis.
Contribution
It proposes a novel, more efficient method using concept vectors to locate concept neurons with fewer computations and demonstrates its effectiveness over previous approaches.
Findings
Outperforms baseline methods in identifying concept neurons.
Reduces computational complexity from O(n*m) to O(n).
Enables analysis of bias and hate speech at neuron level.
Abstract
Locating neurons that are responsible for final predictions is important for opening the black-box large language models and understanding the inside mechanisms. Previous studies have tried to find mechanisms that operate at the neuron level but these methods fail to represent a concept and there is also scope for further optimization of compute required. In this paper, with the help of concept vectors, we propose a method for locating significant neurons that are responsible for representing certain concepts and term those neurons as concept neurons. If the number of neurons is n and the number of examples is m, we reduce the number of forward passes required from O(n*m) to just O(n) compared to the previous works and hence optimizing the time and computation required over previous works. We also compare our method with several baselines and previous methods and our results demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Topic Modeling
