Universal One-third Time Scaling in Learning Peaked Distributions
Yizhou Liu, Ziming Liu, Cengiz Pehlevan, Jeff Gore

TL;DR
This paper reveals that the slow power-law convergence in training large language models is fundamentally caused by softmax and cross-entropy when learning peaked distributions, leading to a universal one-third time scaling law.
Contribution
It identifies the intrinsic cause of power-law convergence in LLM training and introduces a universal scaling law based on the use of softmax and cross-entropy.
Findings
Power-law vanishing losses and gradients arise from softmax and cross-entropy.
Loss scales with time with a universal exponent of 1/3.
Provides a mechanistic explanation for neural scaling laws.
Abstract
Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origin remains debatable. Through systematic analysis of toy models and empirical evaluation of LLMs, we show that this behavior can arise intrinsically from the use of softmax and cross-entropy. When learning peaked probability distributions, e.g., next-token distributions, these components yield power-law vanishing losses and gradients, creating a fundamental optimization bottleneck. This ultimately leads to power-law time scaling of the loss with a universal exponent of . Our results provide a mechanistic explanation for observed neural scaling and suggest new directions for improving LLM training efficiency.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques
