Discovering Chunks in Neural Embeddings for Interpretability
Shuchen Wu, Stephan Alaniz, Eric Schulz, Zeynep Akata

TL;DR
This paper introduces a novel framework for interpreting neural networks by identifying and extracting recurring chunks in neural embeddings, inspired by human cognition, to better understand their internal representations.
Contribution
It demonstrates how to extract interpretable chunks from neural embeddings in RNNs and large language models, providing a new approach to understanding neural population activity.
Findings
Hidden states reflect imposed regularities in RNNs.
Recurring embedding states correspond to concepts in LLMs.
Perturbations to embedding states influence concept activation.
Abstract
Understanding neural networks is challenging due to their high-dimensional, interacting components. Inspired by human cognition, which processes complex sensory data by chunking it into recurring entities, we propose leveraging this principle to interpret artificial neural population activities. Biological and artificial intelligence share the challenge of learning from structured, naturalistic data, and we hypothesize that the cognitive mechanism of chunking can provide insights into artificial systems. We first demonstrate this concept in recurrent neural networks (RNNs) trained on artificial sequences with imposed regularities, observing that their hidden states reflect these patterns, which can be extracted as a dictionary of chunks that influence network responses. Extending this to large language models (LLMs) like LLaMA, we identify similar recurring embedding states…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsLLaMA
