Hidden Dynamics of Massive Activations in Transformer Training

Jorge Gallego-Feliciano; S. Aaron McClendon; Juan Morinelli; Stavros Zervoudakis; and Antonios Saravanos

arXiv:2508.03616·cs.AI·February 25, 2026

Hidden Dynamics of Massive Activations in Transformer Training

Jorge Gallego-Feliciano, S. Aaron McClendon, Juan Morinelli, Stavros Zervoudakis, and Antonios Saravanos

PDF

TL;DR

This paper analyzes the development of massive activations in transformer training, revealing predictable mathematical patterns and enabling prediction from architecture alone, which impacts model design and training efficiency.

Contribution

It introduces a comprehensive analysis of activation emergence in transformers, models the patterns mathematically, and develops a framework to predict these patterns from architecture.

Findings

01

Massive activation emergence follows predictable mathematical patterns.

02

A machine learning framework accurately predicts activation parameters from architecture.

03

Findings have implications for model stability and training optimization.

Abstract

We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed, and release our full dataset publicly to support further research. Through systematic analysis of various model sizes across multiple training checkpoints, we demonstrate that massive activation emergence follows highly predictable mathematical patterns that can be accurately modeled using an exponentially-modulated logarithmic function with five key parameters. Additionally, We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone, achieving high accuracy for steady-state behavior and moderate accuracy for emergence timing and magnitude. These findings enable architects to predict and potentially control key aspects of massive activation emergence through design…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.