Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities
Mayank Jobanputra, Yana Veitsman, Yash Sarrof, Aleksandra Bakalova, Vera Demberg, Ellie Pavlick, Michael Hahn

TL;DR
This paper investigates how large-scale pretraining affects the inherent architectural limitations of transformers, revealing that pretraining improves some capabilities but does not fully overcome fundamental length-generalization constraints.
Contribution
The study provides a theoretical and empirical analysis of how pretraining influences transformer abilities, especially in length generalization and retrieval tasks, highlighting persistent limitations.
Findings
Pretrained models show better right-side token retrieval (induction) than left-side (anti-induction).
Targeted fine-tuning can eliminate the induction-anti-induction asymmetry.
Pretraining enhances certain capabilities but does not fully overcome length-generalization limits.
Abstract
Transformers have theoretical limitations in modeling certain sequence-to-sequence tasks, yet it remains largely unclear if these limitations play a role in large-scale pretrained LLMs, or whether LLMs might effectively overcome these constraints in practice due to the scale of both the models themselves and their pretraining data. We explore how these architectural constraints manifest after pretraining, by studying a family of and tasks inspired by Liu et al. [2024a]. We use a recently proposed framework for studying length generalization [Huang et al., 2025] to provide guarantees for each of our settings. Empirically, we observe an asymmetry, where pretrained models are better at retrieving tokens to the right (induction) rather than the left (anti-induction) of a query token. This asymmetry disappears…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Information Retrieval and Search Behavior
