Anatomical Heterogeneity in Transformer Language Models
Tomasz Wietrzykowski

TL;DR
This paper reveals significant anatomical heterogeneity in transformer language models, showing that different layers have varying importance and training needs, which can be exploited to reduce training costs.
Contribution
It provides empirical evidence of layer heterogeneity in transformer models and introduces a growth training method that allocates resources based on layer importance for efficiency.
Findings
Layer weights follow a regular oscillatory pattern with high predictability.
Layer importance varies over a 10^7 range, affecting model performance.
Growth Transformer Training reduces training costs by ~54% while maintaining quality.
Abstract
Current transformer language models are trained with uniform computational budgets across all layers, implicitly assuming layer homogeneity. We challenge this assumption through empirical analysis of SmolLM2-135M, a 30-layer, 135M-parameter causal language model, using five diagnostic metrics: weight predictability (R2), ablation degradation, recovery speed, weight manipulation robustness, and structural analysis. We find profound anatomical heterogeneity: (1) Layer weights follow strong mathematical regularity (R2 = 0.91) with a universal oscillatory delta pattern (correlation ~= -0.50), yet predicted weights cause catastrophic failure due to nonlinear error accumulation. (2) Layer importance spans a 10^7 range, from a critical core (L8-11, up to +63,419% PPL degradation) to anti-layers (L14, L17) whose removal improves performance. (3) Recovery speed correlates with layer importance,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Natural Language Processing Techniques · Ferroelectric and Negative Capacitance Devices
