Transformers are Universal In-context Learners
Takashi Furuya, Maarten V. de Hoop, Gabriel Peyr\'e

TL;DR
This paper proves that deep transformer architectures can universally approximate any continuous in-context mapping over an arbitrarily large set of tokens, with fixed embedding size and attention heads, for both vision and language tasks.
Contribution
It establishes the universality of transformers for in-context learning, allowing them to handle infinite tokens with fixed size and complexity, a significant theoretical advancement.
Findings
Transformers can approximate any continuous in-context mapping.
A single transformer can handle an infinite number of tokens.
The results apply to both vision and language transformers.
Abstract
Transformers are deep architectures that define "in-context mappings" which enable predicting new tokens based on a given set of tokens (such as a prompt in NLP applications or a set of patches for a vision transformer). In this work, we study in particular the ability of these architectures to handle an arbitrarily large number of context tokens. To mathematically, uniformly address their expressivity, we consider the case that the mappings are conditioned on a context represented by a probability distribution of tokens which becomes discrete for a finite number of these. The relevant notion of smoothness then corresponds to continuity in terms of the Wasserstein distance between these contexts. We demonstrate that deep transformers are universal and can approximate continuous in-context mappings to arbitrary precision, uniformly over compact token domains. A key aspect of our results,…
Peer Reviews
Decision·ICLR 2025 Poster
*Nice prior-work section, succinctly summarizes a very large literature. *The "in-context mapping" formulation is nice, not sure if it was used in prior theory but seems to apply to many IC learners like RNNs and SSMs and so on. *Feel like many of the techniques used could become useful theoretical constructions themselves (like the Laplace-like transform of Lemma 1)
Hard to think of any beyond those already acknowledged by the authors. If I were to reach for one, I feel like the result itself is maybe less interesting than the methods (which are very interesting), since being able to approximate any function often doesn't translate to being able to learn it (e.g. shallow networks, polynomial regression), and so it might be nice to spend some space in the intro or discussion highlighting what if any bearing this has on learning.
I believe this paper provides a solid theoretical foundation for how a Transformer can model an infinite number of tokens, which supports the development of long-context language models. I find it especially surprising that a Transformer can handle an infinite number of tokens with a fixed embedding dimension.
I'm not sure if people actually use the unmasked variant discussed in this paper. For instance, bidirectional models like BERT and ViT must apply positional embeddings to their input tokens, meaning Eq. (2) isn’t typically used in practice. Therefore, I’d say the main contribution of this paper lies in its analysis of Eq. (3), which introduces additional regularity constraints on the token distribution.
The paper explores in depth an interesting theoretical question about the in-context mapping abilities of the transformer architecture, namely its approximation abilities in presence of a context of arbitrary (even continuously infinite) cardinality. The depth and rigour of the mathematical derivations appear (from the best judgment of a non-specialist, see also below) to be of very high quality.
The paper is extremely technical, and would require a considerable amount of time and effort, even by a mathematically inclined reader, to be fully understood. The paper, as it stands, appears to be oriented towards a small number of readers, already familiar with the technical literature relating to measure-theoretic views of the approximation properties of transformers. Its appeal could be made broader by: (1) motivating the approach by potential (even non-immediate) applications and (2) by
Videos
Taxonomy
TopicsRobotics and Automated Systems · Speech and dialogue systems · Hand Gesture Recognition Systems
MethodsAttention Is All You Need · Sparse Evolutionary Training · Softmax · Linear Layer · Multi-Head Attention
