Associative memory inspires improvements for in-context learning using a novel attention residual stream architecture
Thomas F Burns, Tomoki Fukai, Christopher J Earls

TL;DR
This paper introduces a novel residual stream architecture inspired by associative memory, enhancing in-context learning in language models by enabling faster and improved information flow between attention heads.
Contribution
The paper proposes a new residual stream architecture inspired by associative memory models, improving in-context learning speed and performance in small and large language models.
Findings
Faster manifestation of in-context learning abilities during training.
Improved performance on attention head values in larger models.
Effective in small models with 8 million parameters.
Abstract
Large language models (LLMs) demonstrate an impressive ability to utilise information within the context of their input sequences to appropriately respond to data unseen by the LLM during its training procedure. This ability is known as in-context learning (ICL). Humans and non-human animals demonstrate similar abilities, however their neural architectures differ substantially from LLMs. Despite this, a critical component within LLMs, the attention mechanism, resembles modern associative memory models, widely used in and influenced by the computational neuroscience community to model biological memory systems. Using this connection, we introduce an associative memory model capable of performing ICL. We use this as inspiration for a novel residual stream architecture which allows information to directly flow between attention heads. We test this architecture during training within a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Data Stream Mining Techniques · Online Learning and Analytics
MethodsLinear Layer · Dropout · Multi-Head Attention · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Softmax · Attention Is All You Need
