Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency

Ping Wang; Yan-Qi Du

arXiv:2605.02968·cs.LG·May 6, 2026

Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency

Ping Wang, Yan-Qi Du

PDF

TL;DR

This paper introduces a finite-size gradient-transport framework to analyze language model training, revealing distinct transport regimes and scaling behaviors across different models and datasets.

Contribution

It develops a new algebraic framework for measuring transport in language model training, applied to Pico-LM and Pythia, highlighting differences in transport regimes and scaling laws.

Findings

01

Pico-LM shows positive duration scaling and negative efficiency scaling.

02

Pythia remains near the size baseline with weak efficiency scale dependence.

03

Transport measures correlate with external performance metrics.

Abstract

We introduce a finite-size gradient-transport framework for real language-model training, based on five observables $(D, z, β, δ, v_{rel})$ that separate cascade size, duration, absolute transport, and intensive transport efficiency. We analyze direct raw-gradient measurements from Pico-LM across four scales and 125 aligned steps, together with a five-scale Pythia companion dataset built from 153 aligned checkpoint-difference update fields. The same algebraic closure holds in both families, and both share a near-unity cascade-size backbone, but they occupy distinct transport regimes: Pico-LM shows positive duration scaling and negative intensive-efficiency scaling, whereas Pythia remains near the $D = 1$ baseline with only weak positive efficiency scale dependence. Randomized-field controls give nearly matched null floors in the intensive and duration channels, indicating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.