Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities

Mayank Jobanputra; Yana Veitsman; Yash Sarrof; Aleksandra Bakalova; Vera Demberg; Ellie Pavlick; Michael Hahn

arXiv:2505.21785·cs.LG·October 24, 2025

Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities

Mayank Jobanputra, Yana Veitsman, Yash Sarrof, Aleksandra Bakalova, Vera Demberg, Ellie Pavlick, Michael Hahn

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates how large-scale pretraining affects the inherent architectural limitations of transformers, revealing that pretraining improves some capabilities but does not fully overcome fundamental length-generalization constraints.

Contribution

The study provides a theoretical and empirical analysis of how pretraining influences transformer abilities, especially in length generalization and retrieval tasks, highlighting persistent limitations.

Findings

01

Pretrained models show better right-side token retrieval (induction) than left-side (anti-induction).

02

Targeted fine-tuning can eliminate the induction-anti-induction asymmetry.

03

Pretraining enhances certain capabilities but does not fully overcome length-generalization limits.

Abstract

Transformers have theoretical limitations in modeling certain sequence-to-sequence tasks, yet it remains largely unclear if these limitations play a role in large-scale pretrained LLMs, or whether LLMs might effectively overcome these constraints in practice due to the scale of both the models themselves and their pretraining data. We explore how these architectural constraints manifest after pretraining, by studying a family of $retrieval$ and $copying$ tasks inspired by Liu et al. [2024a]. We use a recently proposed framework for studying length generalization [Huang et al., 2025] to provide guarantees for each of our settings. Empirically, we observe an $induction-versus-anti-induction$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) rather than the left (anti-induction) of a query token. This asymmetry disappears…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lacoco-lab/always_a_transformer
pytorchOfficial

Videos

Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Information Retrieval and Search Behavior