Scale Determines Whether Language Models Organize Representation Geometry for Prediction

Weilun Xu

arXiv:2605.17084·cs.LG·May 19, 2026

Scale Determines Whether Language Models Organize Representation Geometry for Prediction

Weilun Xu

PDF

TL;DR

This paper introduces Subspace PGA, a metric to analyze how language model representations are geometrically organized for prediction, revealing scale-dependent differences in predictive structure preservation.

Contribution

The study presents Subspace PGA, a new tool to assess the alignment of representation geometry with prediction readout, and uncovers scale-dependent organization patterns in language models.

Findings

01

Small models lose predictive geometry organization at late layers during training.

02

Large models maintain predictive geometry organization throughout training.

03

Removing dominant directions restores alignment with the readout subspace.

Abstract

In language models, what a representation encodes is determined by the geometry of its representation space: distances, not activations, carry meaning. Existing tools characterize the shape of this geometry but do not ask what that shape is organized for. We introduce Subspace PGA, a metric that tests whether a layer's distance structure aligns with the readout subspace of the unembedding matrix $W_{U}$ more than with random subspaces of equal size. Across seven Pythia models (70M--6.9B) and three cross-family models, intermediate geometry is significantly organized for prediction (peak $z = 9$ -- $24$ ), but the degree is scale-dependent: small models ( $d \leq 1024$ ) progressively lose it at late layers during training -- even as loss keeps improving -- while large models ( $d \geq 2048$ ) preserve it throughout. We trace this to a capacity trade-off: a few dominant directions migrate away…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.