Quantitative Bounds for Length Generalization in Transformers
Zachary Izzo, Eshaan Nichani, Jason D. Lee

TL;DR
This paper provides the first quantitative bounds on the training sequence length needed for transformers to generalize to longer inputs, analyzing various settings and verifying the bounds empirically.
Contribution
It offers the first formal bounds on length generalization in transformers, analyzing different configurations and error controls, and empirically validating the theoretical predictions.
Findings
Length generalization occurs when transformer behavior on longer sequences can be simulated by shorter ones.
Quantitative bounds depend on the complexity of the task and transformer architecture.
Empirical results support the theoretical bounds and insights.
Abstract
We study the problem of length generalization (LG) in transformers: the ability of a model trained on shorter sequences to maintain performance when evaluated on much longer, previously unseen inputs. Prior work by Huang et al. (2025) established that transformers eventually achieve length generalization once the training sequence length exceeds some finite threshold, but left open the question of how large it must be. In this work, we provide the first quantitative bounds on the required training length for length generalization to occur. Motivated by previous empirical and theoretical work, we analyze LG in several distinct problem settings: error control vs. average error control over an input distribution, infinite-precision softmax attention vs. finite-precision attention (which reduces to an argmax) in the transformer, and one- vs. two-layer transformers. In all…
Peer Reviews
Decision·ICLR 2026 Oral
- The length of the training sequence required for LG is quantified for the first time, addressing the key issue that previous studies have only demonstrated the “existence of a threshold” but have not clarified the “size of the threshold”. - The conclusions are more broadly applicable by also considering a variety of variables such as the type of accuracy, the number of model layers, and the way the error is controlled. Covers many different types of theoretical scenarios and explores the quant
- Very interesting paper that gives a quantitative analysis of length generalizability, based on the theoretical framework of the “limit transformer”, and therefore lacks an analysis of the transformer scheme with relative positional coding, which is currently more used in various methods. - The paper verifies the correctness of its theory on a small-scale transformer structure, and it is hoped that the paper will give further analysis on whether there are limitations in the analysis and experim
* Generally well written and well connected with the literatur. * Provides theoretical proof to support quantitative bounds for required training length * Provides some empirical support that error rate at which test loss plateus decreases with increasing training length.
Not any significant blocker to acceptance to my awareness. One could question the practical impact or limited scope, but it is targeted as a theoretical paper under learning theory - and seems to sufficiently fulfill its targeted goal. There are precedents of papers with similar scope getting accepted. Besides this, one limitation is that the study is mainly tied to (periodic) absolute positional encodings (as far as I understand) - but the authors keep studies of other positional schemes for
1. The Dirichlet assumption in Theorem 4.2 is a good choice, as classical NLP literature has worked with it for decades in many practical applications. This makes the contribution of this work more practical 2. The authors' efforts in studying both finite and infinite precision and also providing bounds for L_{inf} and average case for finite setting is a good contribution to this line of research. 3. I found Lemma 5.3 to be a useful addition. While I did not read the proof end to end, this is
1. I have a concern about the value of N - $2^{\frac{p}{\min\{\gamma(f), \gamma(g)\}}}$ . While I understand the requirements for the proofs, I am curious if the authors have some thoughts on adapting equation 2 on page 13 – for say cases where the logit differences can be binned and perhaps some covering number arguments can be used? Please note this is just a rough guess and I have not put proper thought behind this statement, and I am just curious about the ideas from the authors’ end. 2. I
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Stochastic Gradient Optimization Techniques · Topic Modeling
