Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

Namrata Shivagunde; Vijeta Deshpande; Sherin Muckatira; Anna Rumshisky

arXiv:2605.13652·cs.LG·May 20, 2026

Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

Namrata Shivagunde, Vijeta Deshpande, Sherin Muckatira, Anna Rumshisky

PDF

TL;DR

This study compares various low-rank pre-training methods for large language models to full-rank training, revealing they converge to distinct solutions with different geometric and spectral properties despite similar perplexity.

Contribution

The paper provides a comprehensive geometric and spectral analysis of five low-rank pre-training methods versus full-rank training across multiple scales, highlighting their differences beyond perplexity.

Findings

01

Low-rank methods converge to different loss landscape basins than full-rank training.

02

Activations in later layers diverge from full-rank solutions, with some methods tracking more closely.

03

Perplexity alone is insufficient to evaluate downstream performance; geometric metrics offer better insights.

Abstract

Pre-training large language models is dominated by the memory cost of storing full-rank weights, gradients, and optimizer states. Low-rank pre-training has emerged to address this, and the space of methods has grown rapidly. A central question remains open: do low-rank methods produce models that generalize comparably to full-rank training, or does the rank constraint fundamentally alter the solutions reached? Existing comparisons rely almost entirely on validation perplexity from single-seed runs, often carried forward from prior literature. Yet perplexity is a poor proxy for solution quality; two methods can match on perplexity while converging to different loss landscape regions and internal representations. We close this gap by characterizing the solutions found by five low-rank pre-training methods, GaLore and Fira (memory-efficient optimizers), CoLA and SLTrain (architecture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.