Learning to Recall with Transformers Beyond Orthogonal Embeddings
Nuri Mert Vural, Alberto Bietti, Mahdi Soltanolkotabi, Denny Wu

TL;DR
This paper analyzes how single-layer transformers with random, non-orthogonal embeddings learn to recall tokens, providing explicit formulas for their storage capacity and validating the intrinsic multiplicative scaling with dataset size, embedding dimension, and sequence length.
Contribution
It offers a theoretical analysis of transformers trained on finite, realistic data with non-orthogonal embeddings, deriving explicit storage capacity formulas and validating their intrinsic scaling.
Findings
Explicit formulas for storage capacity depending on N, d, and L.
Validation of multiplicative scaling through numerical experiments.
Lower bounds demonstrating the intrinsic nature of the scaling.
Abstract
Modern large language models (LLMs) excel at tasks that require storing and retrieving knowledge, such as factual recall and question answering. Transformers are central to this capability because they can encode information during training and retrieve it at inference. Existing theoretical analyses typically study transformers under idealized assumptions such as infinite data or orthogonal embeddings. In realistic settings, however, models are trained on finite datasets with non-orthogonal (random) embeddings. We address this gap by analyzing a single-layer transformer with random embeddings trained with (empirical) gradient descent on a simple token-retrieval task, where the model must identify an informative token within a length- sequence and learn a one-to-one mapping from tokens to labels. Our analysis tracks the ``early phase'' of gradient descent and yields explicit formulas…
Peer Reviews
Decision·ICLR 2026 Poster
This work provides a clear problem formulation and rigorous proofs supporting its theoretical claims. The technical results are overall a good contribution . Beyond theory, it offers numerical experiments that closely match the analytical predictions, showing the precision and tightness of the derived results. The findings also yield practical insights into transformer training, key architectural parameter choice, the importance of the MLP component.
- The clarity of some parts needs improvement (see Questions). It is recommended to include a brief intuition or proof sketch to better convey the key insights underlying the proof. The current presentation may be less accessible to a broader ML audience. - The gap between theory and practice remains substantial. The definition of the factual recall task requires further clarification. In the current setup, the model identifies a marked token within a sequence and maps it to another token throu
+ The paper is clearly written, with precise notations and formal language, and includes detailed technical assumptions. + The theoretical results appear sound and solid. To the best of my knowledge, this work provides the first theoretical analysis of training transformers to perform recall tasks, particularly in the regime where the embedding dimension can be smaller than the vocabulary size. + The theoretical findings are further validated through numerical experiments conducted in controll
- Some assumptions may require further justification. For instance, in Assumption 1, it is not entirely clear why it is necessary to set $L = V^c, c \in (0, 1)$? Under this assumption, the sequence length is necessarily smaller than the vocabulary size, and the motivation for this specific scaling is not well explained. - It is also suggested that the authors present their results in a more comparative manner and provide readers with additional background context. My understanding is that the m
1. This paper addresses an important problem of the learning dynamics for Transformer models. 2. The paper complements its theoretical predictions through experiments.
1. The results in the paper are limited to a simple toy model (easy task, single layer, 3 simplified gradient steps on a fixed batch). Although additional experiments exist beyond this toy setting, it remains unclear how the results will generalize to more realistic settings. 2. The setting assumes that the input contains a trigger embedding that marks the informative token, which makes this toy setting less practical. It is difficult to imagine a realistic task where this assumption holds.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Natural Language Processing Techniques
