Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency
Ping Wang, Yan-Qi Du

TL;DR
This paper introduces a finite-size gradient-transport framework to analyze language model training, revealing distinct transport regimes and scaling behaviors across different models and datasets.
Contribution
It develops a new algebraic framework for measuring transport in language model training, applied to Pico-LM and Pythia, highlighting differences in transport regimes and scaling laws.
Findings
Pico-LM shows positive duration scaling and negative efficiency scaling.
Pythia remains near the size baseline with weak efficiency scale dependence.
Transport measures correlate with external performance metrics.
Abstract
We introduce a finite-size gradient-transport framework for real language-model training, based on five observables that separate cascade size, duration, absolute transport, and intensive transport efficiency. We analyze direct raw-gradient measurements from Pico-LM across four scales and 125 aligned steps, together with a five-scale Pythia companion dataset built from 153 aligned checkpoint-difference update fields. The same algebraic closure holds in both families, and both share a near-unity cascade-size backbone, but they occupy distinct transport regimes: Pico-LM shows positive duration scaling and negative intensive-efficiency scaling, whereas Pythia remains near the baseline with only weak positive efficiency scale dependence. Randomized-field controls give nearly matched null floors in the intensive and duration channels, indicating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
