Universal One-third Time Scaling in Learning Peaked Distributions

Yizhou Liu; Ziming Liu; Cengiz Pehlevan; Jeff Gore

arXiv:2602.03685·cs.LG·February 4, 2026

Universal One-third Time Scaling in Learning Peaked Distributions

Yizhou Liu, Ziming Liu, Cengiz Pehlevan, Jeff Gore

PDF

Open Access

TL;DR

This paper reveals that the slow power-law convergence in training large language models is fundamentally caused by softmax and cross-entropy when learning peaked distributions, leading to a universal one-third time scaling law.

Contribution

It identifies the intrinsic cause of power-law convergence in LLM training and introduces a universal scaling law based on the use of softmax and cross-entropy.

Findings

01

Power-law vanishing losses and gradients arise from softmax and cross-entropy.

02

Loss scales with time with a universal exponent of 1/3.

03

Provides a mechanistic explanation for neural scaling laws.

Abstract

Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origin remains debatable. Through systematic analysis of toy models and empirical evaluation of LLMs, we show that this behavior can arise intrinsically from the use of softmax and cross-entropy. When learning peaked probability distributions, e.g., next-token distributions, these components yield power-law vanishing losses and gradients, creating a fundamental optimization bottleneck. This ultimately leads to power-law time scaling of the loss with a universal exponent of $1/3$ . Our results provide a mechanistic explanation for observed neural scaling and suggest new directions for improving LLM training efficiency.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques