A Bayesian Model Selection Criterion for Selecting Pretraining Checkpoints
Michael Munn, Susan Wei

TL;DR
This paper proposes a Bayesian model selection criterion called downstream free energy to identify pretraining checkpoints that are most adaptable for downstream tasks, without requiring access to downstream data.
Contribution
It introduces a novel Bayesian criterion for selecting pretraining checkpoints that correlates with downstream performance and does not need downstream data or prior task knowledge.
Findings
The criterion reliably predicts better finetuning outcomes.
It correlates well with downstream task adaptability.
The method is effective across different models and tasks.
Abstract
Recent advances in artificial intelligence have been fueled by the development of foundation models such as BERT, GPT, T5, and Vision Transformers. These models are first pretrained on vast and diverse datasets and then adapted to specific downstream tasks, often with significantly less data. However, the mechanisms behind the success of this ubiquitous pretrain-then-adapt paradigm remain underexplored, particularly the characteristics of pretraining checkpoints that enhance downstream adaptation. We introduce a Bayesian model selection criterion, called the downstream free energy, which quantifies a checkpoint's adaptability by measuring the concentration of nearby favorable parameters for the downstream task. We demonstrate that this Bayesian model selection criterion can be effectively implemented without access to the downstream data or prior knowledge of the downstream task.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Engineering Applied Research · Nuclear and radioactivity studies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Gated Linear Unit · WordPiece · SentencePiece · Inverse Square Root Schedule · Linear Layer · Cosine Annealing · Adafactor · Linear Warmup With Linear Decay
