A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions
Tiexin Ding

TL;DR
This paper introduces a Weibull distribution-based diagnostic framework for analyzing transformer weight distributions, revealing distinct patterns across model components and training stages.
Contribution
It applies a two-parameter Weibull model to transformer weights, providing architecture-independent diagnostics and insights into training dynamics.
Findings
FFN modules and output projections have a narrow Weibull shape parameter range.
Attention input projections deviate from Weibull, influenced by storage methods.
Shape parameter lambda increases during training, correlating with training hyperparameters.
Abstract
We apply the Weibull distribution -- a two-parameter family from extreme-value theory -- as a diagnostic framework for element-wise weight magnitude distributions in transformers. At initialization, i.i.d. Gaussian weights give |w| ~ HalfNormal, yielding k ~ 1.20 via middle-80% probability-plot fit (the protocol used throughout this work). This anchor makes k a principled, architecture-independent measuring stick for training dynamics; fitting each weight matrix independently at every layer at every checkpoint enables per-component, per-layer, and per-step diagnostics that aggregate statistics cannot resolve. Applying this framework to 12 model entries spanning 7 architectural families (Pythia, OLMo-1/2, LLaMA-3, Mistral, Qwen2.5/3) reveals three findings. First, FFN modules and the attention output projection W_o -- the Transmission Class -- fall in a narrow k band: median terminal k…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
