Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality

Ryotaro Kawata; Taiji Suzuki

arXiv:2602.01863·stat.ML·February 3, 2026

Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality

Ryotaro Kawata, Taiji Suzuki

PDF

Open Access 3 Reviews

TL;DR

This paper presents a measure-theoretic framework for Transformers as associative memory, providing theoretical analysis and minimax optimality results for their ability to recall and predict from distributional contexts.

Contribution

It introduces a novel measure-theoretic perspective on Transformers, proving their optimality and providing a framework for analyzing long-context recall with guarantees.

Findings

01

Transformer with softmax attention learns recall-and-predict maps under spectral assumptions.

02

A minimax lower bound matches the upper bound, showing optimal convergence rates.

03

Framework enables principled design and analysis of Transformers for distributional contexts.

Abstract

Transformers excel through content-addressable retrieval and the ability to exploit contexts of, in principle, unbounded length. We recast associative memory at the level of probability measures, treating a context as a distribution over tokens and viewing attention as an integral operator on measures. Concretely, for mixture contexts $ν = I^{- 1} \sum_{i = 1}^{I} μ^{(i^{*})}$ and a query $x_{q} (i^{*})$ , the task decomposes into (i) recall of the relevant component $μ^{(i^{*})}$ and (ii) prediction from $(μ_{i^{*}}, x_{q})$ . We study learned softmax attention (not a frozen kernel) trained by empirical risk minimization and show that a shallow measure-theoretic Transformer composed with an MLP learns the recall-and-predict map under a spectral assumption on the input densities. We further establish a matching minimax lower bound with the same rate exponent (up to multiplicative…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 3

Strengths

The paper attempts a theoretical analysis of a practically relevant setup in transformers, which is always welcome and appreciated. It presents matching upper and lower bounds to the risk of a statistical estimation problem that they pose as a proxy to the associative memory problem.

Weaknesses

The biggest issue of this paper is that the mathematical notation is (a) far too dense for a conference such as ICLR, and (b) just poor, objectively (there are too many to list, but see Questions for a few). At a high level, the paper reads as follows to me: the paper spend 7 pages (!) setting up notation and making several assumptions (some of which, admittedly, do seem reasonable, but, e.g., the eigenvalue decay rate, just show up out of nowhere) and then obtains some results that likely do n

Reviewer 02Rating 4Confidence 2

Strengths

To the extent of my understanding, which is limited, the modeling of contexts as probability distributions seems an interesting line of research, with several other recent papers treating it.

Weaknesses

Clarity: I found the paper extremely hard to parse (which may be my fault, and not a weakness, I let the area chair judge). For example: - line 73: it is not clear what is meant by "context" - line 74: it is not clear on which space do the measuere nu, mu^(i) live in - line 75: it is not clear which space the query x_q lives in - line 76: what is the "target map". Is it some sort of ground-truth function? This should be defined clearly - etc... While stated as "informal", I find this introductor

Reviewer 03Rating 8Confidence 3

Strengths

The authors carry out a sophisticated theoretical analysis of the problem. Particularly nice is the tight characterization of the excess-risk rate. Despite the complex nature of the work, the paper is relatively well written and organized. I especially like that the authors reserved some space to give a quick proof sketch for theorems in the main body. The proofs seem reasonable, even though I could not check the appendix in detail.

Weaknesses

The limit of the work is obviously the lack of clear practical implications, being a purely theoretical work. The lack of experimental results is also a consequence of the same fact. Nonetheless, the results are interesting and a worthy theoretical contribution. On the minor side: 1. Lines 176-177 are repeated at lines 178-179; 2. The definition at line 211 is confusing: there is no F in the argument; 3. I would take a little space to explain what RKHS in Sec. 3.1.1; 4. There is an extra "that"

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Information Retrieval and Search Behavior