It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs
Jun Wu, Patrick Huang, Jiangtao Wen, Yuxing Han

TL;DR
This paper introduces a statistical modeling approach using generalized Gaussian distributions to improve initialization, training efficiency, and communication cost reduction in large language models, resulting in faster, smaller, and more efficient models.
Contribution
It proposes a GG-based initialization, ACT training method, and GCT gradient constraint algorithm, advancing scalable and hardware-aware LLM training.
Findings
Models are well modeled by generalized Gaussian distributions.
The proposed methods lead to smaller, faster models with minimal communication overhead.
Experiments show improved convergence and accuracy across architectures.
Abstract
Despite rapid progress in large language models (LLMs), the statistical structure of their weights, activations, and gradients-and its implications for initialization, training dynamics, and efficiency-remains largely unexplored. We empirically show that these quantities in LLMs are well modeled by generalized Gaussian (GG) distributions, and introduce a unified, end-to-end optimization framework grounded in this observation. Our contributions are threefold: (1) a GG-based initialization that aligns with trained model statistics, accelerating convergence and improving accuracy; (2) ACT, a progressive activation-constrained training method that reduces redundancy and propagation overhead; and (3) GCT, a gradient-constrained training algorithm that substantially lowers communication cost in distributed training. Experiments across diverse architectures demonstrate consistently smaller,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education
