Looped Transformers for Length Generalization
Ying Fan, Yilun Du, Kannan Ramchandran, Kangwook Lee

TL;DR
This paper introduces looped Transformers with adaptive steps that significantly enhance length generalization on iterative tasks, outperforming standard Transformers in handling unseen input lengths.
Contribution
The paper proposes a novel looped Transformer architecture with an adaptive number of steps, improving length generalization on tasks with iterative solutions.
Findings
Looped Transformers achieve better length generalization.
Adaptive step mechanism enhances iterative task performance.
Transformers learn highly length-generalizable solutions.
Abstract
Recent work has shown that Transformers trained from scratch can successfully solve various arithmetic and algorithmic tasks, such as adding numbers and computing parity. While these Transformers generalize well on unseen inputs of the same length, they struggle with length generalization, i.e., handling inputs of unseen lengths. In this work, we demonstrate that looped Transformers with an adaptive number of steps significantly improve length generalization. We focus on tasks with a known iterative solution, involving multiple iterations of a RASP-L operation - a length-generalizable operation that can be expressed by a finite-sized Transformer. We train looped Transformers using our proposed learning algorithm and observe that they learn highly length-generalizable solutions for various tasks.
Peer Reviews
Decision·ICLR 2025 Poster
S1. The paper is written and organized well. Overall, the presentation of the methodology and empirical results is clear and easy to follow. S2. The idea behind the proposed method is neat and plausible. It is natural to think about adaptively scaling the depth of the model according to the problem length or the problem complexity. This paper successfully implements this idea to solve various interesting algorithmic tasks with the power of Looped Transformers. Also, $n$-RASP-L is an interesting
**W1. The definition of $n$-RASP-L (Definition 3.1) can be improved.** - I think the equation “$T(n): \mathbb{N} \rightarrow \mathbb{N}$” should be corrected to “$T: \mathbb{N} \rightarrow \mathbb{N}$” because $T$ (instead of $T(n)$) is a function of input length $n$ representing the number of steps inside a task-solving $n$-RASP-L program. - In (2), I guess $P’$ should be a RASP-L program, which is unspecified in the definition. - Should $P$ be decomposed to a sequential application of $P’$, i
- The paper is well-structured and clearly written. - The introduction of Looped Transformers is well-motivated and effectively argued. - The results are strong and solid. They do not require the use of a scratchpad. Also, the prediction is conducted using an end-to-end, full-answer prediction setup, which is a more general way than the conventional next-token prediction setup. - The paper clearly illustrates that the model can determine the number of steps to take on its own and does not requir
Weakness 1: Applicability Limited to n-RASP-L Tasks - The approach is limited to tasks that belong to n-RASP-L categories, as it requires the ground-truth number of steps in the training data. Weakness 2: Insufficient Experimentation. - ***Effect of Curriculum Learning.*** How does the model perform without curriculum learning? Is the use of curriculum learning necessary? - ***Tolerance to Step Counts.*** I am curious whether this method will still perform well with different choices of T(n)
Overall, I really liked the paper, I think that using a looped transformer to achieve length generalization is an interesting idea that was not studied in the past to my knowledge. This paper complements all the other techniques (universal transformers, different types of position emebedding, etc.) that were used in the past for length generalization The paper is well-written and well-explained. This is why I advocate for acceptance of this paper.
I would like to raise the following weaknesses/questions regarding this paper: - **Lack of other baselines**: What would happen if you have a very deep universal transformer? Universal transformers also have shared parameters and looks equivalent to the loop transformer. The depth may play the role of the number of loops. Would this be equivalent to the fixed loop NTP? It would be interesting to run the same experiments with a universal transformer. - **Comparison with other methods**: Where
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Dropout · Dense Connections
