Linking In-context Learning in Transformers to Human Episodic Memory
Li Ji-An, Corey Y. Zhou, Marcus K. Benna, Marcelo G. Mattar

TL;DR
This paper explores the parallels between attention mechanisms in Transformer models and human episodic memory, revealing that certain attention heads function similarly to human memory processes and are crucial for in-context learning.
Contribution
It identifies and characterizes CMR-like attention heads in Transformers, linking them to human episodic memory and demonstrating their causal role in in-context learning.
Findings
CMR-like heads emerge in intermediate and late layers of LLMs.
Ablation of CMR-like heads impairs in-context learning performance.
Attention heads exhibit behaviors similar to human memory biases.
Abstract
Understanding connections between artificial and biological intelligent systems can reveal fundamental principles of general intelligence. While many artificial intelligence models have a neuroscience counterpart, such connections are largely missing in Transformer models and the self-attention mechanism. Here, we examine the relationship between interacting attention heads and human episodic memory. We focus on induction heads, which contribute to in-context learning in Transformer-based large language models (LLMs). We demonstrate that induction heads are behaviorally, functionally, and mechanistically similar to the contextual maintenance and retrieval (CMR) model of human episodic memory. Our analyses of LLMs pre-trained on extensive text data show that CMR-like heads often emerge in the intermediate and late layers, qualitatively mirroring human memory biases. The ablation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications · EEG and Brain-Computer Interfaces
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
