Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training
Namrata Shivagunde, Vijeta Deshpande, Sherin Muckatira, Anna Rumshisky

TL;DR
This study compares various low-rank pre-training methods for large language models to full-rank training, revealing they converge to distinct solutions with different geometric and spectral properties despite similar perplexity.
Contribution
The paper provides a comprehensive geometric and spectral analysis of five low-rank pre-training methods versus full-rank training across multiple scales, highlighting their differences beyond perplexity.
Findings
Low-rank methods converge to different loss landscape basins than full-rank training.
Activations in later layers diverge from full-rank solutions, with some methods tracking more closely.
Perplexity alone is insufficient to evaluate downstream performance; geometric metrics offer better insights.
Abstract
Pre-training large language models is dominated by the memory cost of storing full-rank weights, gradients, and optimizer states. Low-rank pre-training has emerged to address this, and the space of methods has grown rapidly. A central question remains open: do low-rank methods produce models that generalize comparably to full-rank training, or does the rank constraint fundamentally alter the solutions reached? Existing comparisons rely almost entirely on validation perplexity from single-seed runs, often carried forward from prior literature. Yet perplexity is a poor proxy for solution quality; two methods can match on perplexity while converging to different loss landscape regions and internal representations. We close this gap by characterizing the solutions found by five low-rank pre-training methods, GaLore and Fira (memory-efficient optimizers), CoLA and SLTrain (architecture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
