Hidden Dynamics of Massive Activations in Transformer Training
Jorge Gallego-Feliciano, S. Aaron McClendon, Juan Morinelli, Stavros Zervoudakis, and Antonios Saravanos

TL;DR
This paper analyzes the development of massive activations in transformer training, revealing predictable mathematical patterns and enabling prediction from architecture alone, which impacts model design and training efficiency.
Contribution
It introduces a comprehensive analysis of activation emergence in transformers, models the patterns mathematically, and develops a framework to predict these patterns from architecture.
Findings
Massive activation emergence follows predictable mathematical patterns.
A machine learning framework accurately predicts activation parameters from architecture.
Findings have implications for model stability and training optimization.
Abstract
We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed, and release our full dataset publicly to support further research. Through systematic analysis of various model sizes across multiple training checkpoints, we demonstrate that massive activation emergence follows highly predictable mathematical patterns that can be accurately modeled using an exponentially-modulated logarithmic function with five key parameters. Additionally, We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone, achieving high accuracy for steady-state behavior and moderate accuracy for emergence timing and magnitude. These findings enable architects to predict and potentially control key aspects of massive activation emergence through design…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
