Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing

Valentina Njaradi; Cl\'ementine Domin\'e; Rachel Swanson; Marco Mondelli; Andrew Saxe

arXiv:2605.20105·cs.LG·May 20, 2026

Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing

Valentina Njaradi, Cl\'ementine Domin\'e, Rachel Swanson, Marco Mondelli, Andrew Saxe

PDF

TL;DR

This paper develops an analytical model for understanding how the size of learned representations affects generalisation in pretraining and linear probing, revealing optimal strategies depending on data availability.

Contribution

It provides a high-dimensional analytical framework linking representation size, data quantities, and task alignment, guiding optimal pretraining and probing strategies.

Findings

01

Maximally compressed representations are optimal with abundant pretraining data and scarce downstream data.

02

Higher-dimensional representations generalise better when pretraining data is limited.

03

An exact trade-off quantifies unlabelled data needed to replace labelled samples.

Abstract

Learning to generalise from limited data is a fundamental challenge for both artificial and biological systems. A common strategy is to extract reusable structure from abundant unlabelled data, enabling efficient adaptation to new tasks from limited labelled data. This two-stage paradigm is now standard in modern training pipelines, where pretraining is followed by fine-tuning or linear probing. We provide an analytical model of this process: structure extraction is formalized as principal component analysis on unlabelled data, and downstream learning as linear regression on a separate labelled dataset. In the high-dimensional regime, we derive exact expressions for training and generalisation error showcasing their dependence on representation dimensionality, unlabelled and labelled sample sizes, and task alignment. Our results show that pretrained representations strongly influence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.