Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing
Valentina Njaradi, Cl\'ementine Domin\'e, Rachel Swanson, Marco Mondelli, Andrew Saxe

TL;DR
This paper develops an analytical model for understanding how the size of learned representations affects generalisation in pretraining and linear probing, revealing optimal strategies depending on data availability.
Contribution
It provides a high-dimensional analytical framework linking representation size, data quantities, and task alignment, guiding optimal pretraining and probing strategies.
Findings
Maximally compressed representations are optimal with abundant pretraining data and scarce downstream data.
Higher-dimensional representations generalise better when pretraining data is limited.
An exact trade-off quantifies unlabelled data needed to replace labelled samples.
Abstract
Learning to generalise from limited data is a fundamental challenge for both artificial and biological systems. A common strategy is to extract reusable structure from abundant unlabelled data, enabling efficient adaptation to new tasks from limited labelled data. This two-stage paradigm is now standard in modern training pipelines, where pretraining is followed by fine-tuning or linear probing. We provide an analytical model of this process: structure extraction is formalized as principal component analysis on unlabelled data, and downstream learning as linear regression on a separate labelled dataset. In the high-dimensional regime, we derive exact expressions for training and generalisation error showcasing their dependence on representation dimensionality, unlabelled and labelled sample sizes, and task alignment. Our results show that pretrained representations strongly influence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
