Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor
Yang Zhao, Jiahao Lu, Bin Huang, Guhua Zhang, Jie Zhou

TL;DR
This study revisits transformer modifications at 1-3B scale, confirming most do not transfer effectively, emphasizing the importance of noise-floor, downstream evaluation, and stability testing for fair comparison.
Contribution
It provides a rigorous, controlled evaluation of recent transformer modifications at larger scales, highlighting their limited transferability and the need for standardized testing protocols.
Findings
Most modifications do not transfer at 1.2B and 3B scales.
Only two modifications pass Bonferroni correction at 1.2B.
Attention-output modifications show larger gaps between loss and downstream performance.
Abstract
Narang et al. (2021) evaluated 40+ Transformer modifications at T5-base scale and concluded that most did not transfer. Five years later, the typical working regime has moved to 1-3B parameters, downstream evaluation has replaced pretraining perplexity, and a substantially different catalogue of modifications has emerged. We revisit their question by testing 20 post-2021 Transformer modifications at 1.2B and 3B under strict iso-data, iso-compute, iso-recipe control, with a multi-seed baseline noise floor and CLIMB-12 downstream evaluation as the primary metric. The central finding reproduces theirs at this curated set: most modifications do not transfer. Of the 20 modifications, only two clear Bonferroni correction at 1.2B; one of those two further fails to train stably at 3B under the shared recipe. We also find that the loss-downstream gap reported by Tay et al. (2023) enlarges…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
