Memory Mosaics
Jianyu Zhang, Niklas Nolte, Ranajoy Sadhukhan, Beidi Chen, L\'eon, Bottou

TL;DR
Memory Mosaics are transparent associative memory networks that perform comparably or better than transformers on language tasks, offering compositional and in-context learning abilities with clearer interpretability.
Contribution
This paper introduces Memory Mosaics, a novel associative memory network architecture that achieves transformer-like capabilities with greater transparency.
Findings
Memory Mosaics perform as well or better than transformers on language modeling.
They demonstrate compositional and in-context learning capabilities.
The approach offers more transparent interpretability compared to transformers.
Abstract
Memory Mosaics are networks of associative memories working in concert to achieve a prediction task of interest. Like transformers, memory mosaics possess compositional capabilities and in-context learning capabilities. Unlike transformers, memory mosaics achieve these capabilities in comparatively transparent way ("predictive disentanglement"). We illustrate these capabilities on a toy example and also show that memory mosaics perform as well or better than transformers on medium-scale language modeling tasks.
Peer Reviews
Decision·ICLR 2025 Poster
1. The integration of associative memories to replicate and surpass the capabilities of transformers. 2. The concept of predictive disentanglement is novel and also rooted in a solid theoretical framework. 3. The theoretical motivations behind predictive disentanglement are well-explained.
1. The paper does not provide exhaustive details on the architecture's configuration. 2. Lack of detailed discussion on the choice and impact of hyperparameters. 3.The experimental validation is limited to certain language tasks. 4. Lack of Ablation Studies
In general, this is a though-provoking paper with some interesting ideas. Exploration of the relationship between associative memories, attention, and Transformer blocks is valuable, although the current presentation is heavily biased, and omits related ideas from prior work. The empirical results on language modeling are encouraging, although small scale.
One problem with this submission is that the presentation almost entirely ignores the work on modern Hopfield networks and dense associative memories, which tackles closely related motivation and ideas. Specifically, the authors’ proposal is closely related to [Energy Transformer (NeurIPS 2024)](https://proceedings.neurips.cc/paper_files/paper/2023/file/57a9b97477b67936298489e3c1417b0a-Paper-Conference.pdf) and related literature, which replaces elements of Transformer block with associative mem
The paper presents an interesting, seemingly novel architecture for a prediction model based on associative memories. The architecture seems well justified from first principles and links nicely with existing works on the attention mechanism. This work is of good quality and presents its ideas reasonably well, including rigorous formalizations and numerous figures explaining the proposed architectures. The paper benchmarks the Memory Mosaic against a toy dataset to demonstrate properties of the
To my reading, the architecture is not differentiated strongly enough from existing, similar work on associative memories and the attention mechanism. The key property of the architecture, "peeking" in the value predictions, is not explained clearly enough for a reader to understand the significance. The meta-learning interpretation in Section 3 is somewhat confusing in how this relates to the broader field of meta-learning. The toy dataset, while very useful in understanding "predictive disenta
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMemory, Trauma, and Commemoration
