On the Optimal Memorization Capacity of Transformers
Tokio Kajitsuka, Issei Sato

TL;DR
This paper investigates the memorization capacity of Transformers, showing they can efficiently memorize data with near-optimal parameter counts, highlighting the role of self-attention and feed-forward networks in this process.
Contribution
It provides theoretical bounds on the number of parameters Transformers need to memorize data, revealing optimality and the influence of network components.
Findings
Transformers memorize labels with ilde{O}(\u221a{N}) parameters, which is optimal.
In sequence-to-sequence tasks, ilde{O}(\u221a{nN}) parameters are both sufficient and necessary.
Self-attention efficiently identifies input sequences; feed-forward networks can be a bottleneck.
Abstract
Recent research in the field of machine learning has increasingly focused on the memorization capacity of Transformers, but how efficient they are is not yet well understood. We demonstrate that Transformers can memorize labels with parameters in a next-token prediction setting for input sequences of length , which is proved to be optimal up to logarithmic factors. This indicates that Transformers can efficiently perform memorization with little influence from the input length owing to the benefit of parameter sharing. We also analyze the memorization capacity in the sequence-to-sequence setting, and find that parameters are not only sufficient, but also necessary at least for Transformers with hardmax. These results suggest that while self-attention mechanisms can efficiently identify input sequences, the feed-forward network…
Peer Reviews
Decision·ICLR 2025 Poster
- This paper is well-motivated and clearly presented, making it easy to follow. - This paper is technically solid; the construction of a single-layer Transformer that can identify identical sequence ids from token-wise \((r,\delta)\)-separated sequences with near-optimal parameter order is novel. Additionally, the authors provide a parameter lower bound for the seq2seq case when the self-attention layer uses hard attention. - Furthermore, the paper considers bit complexity, which is often overlo
- Token-wise \((r,\delta)\)-separated sequences may not be an expressive enough model to fully capture the memorization capacity of Transformers. It appears that the authors construct a Transformer that captures contextual information by 'counting' the occurrences of tokens in a subset. However, Transformers typically capture contextual information from the mutual relationships among tokens via the attention mechanism. - In the proof of Theorem 4.1, the attention layer is set to uniform attenti
- The proof sketch of Theorem 4.1 is clear and well motivated - In the next-token setting, the paper proves matching upper and lower bounds, and in the sequence to sequence setting they match for hardmax transformers - The authors provide clear comparisons with prior work throughout the paper
- **bit complexity:** As the authors point out, the transformer needs at least $\Omega(N)$ bits, so to succeed with $\sqrt{N}$ parameters it needs a bit complexity of $\sqrt{N}$. However, this is very unrealistic as these models are trained in finite precisions, most often 16 bit precision. It is therefore unclear if these results say much about transformers ability to memorize in practice. This also affects the motivation, for example I would expected double descent (lines 102-107) to occur at
1. The results seem promising and strong, because both upper and lower bounds are provided. 2. The studied problem and method provide interesting insight into transformers.
1. The title wording could be improved for accuracy, I would suggest 'near-optimal...' instead of 'optimal...' since there is a logarithm term mismatch between lower bound and upper bound. 2. The paper can have more discussion on the separatedness assumption. For example, how realistic is the assumption?
Videos
Taxonomy
TopicsAdvanced Memory and Neural Computing · Neural Networks and Applications
