Weight Decay Improves Language Model Plasticity
Tessa Han, Sebastian Bordt, Hanlin Zhang, Sham Kakade

TL;DR
This paper investigates how weight decay during pretraining influences language model plasticity, showing that higher weight decay enhances downstream adaptability by promoting more linearly separable representations and better regularization.
Contribution
It reveals that larger weight decay values improve model plasticity, leading to better fine-tuning performance, and highlights the importance of evaluation metrics beyond validation loss.
Findings
Models with larger weight decay are more adaptable to downstream tasks.
Weight decay encourages linearly separable representations and regularizes attention matrices.
Pretraining with higher weight decay can improve fine-tuning outcomes despite worse initial performance.
Abstract
The prevailing paradigm in large language model (LLM) development is to pretrain a base model, then perform further training to improve performance and model behavior. However, hyperparameter optimization and scaling laws have been studied primarily from the perspective of the base model's validation loss, ignoring downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks through fine-tuning. We focus on the role of weight decay, a key regularization parameter during pretraining. Through systematic experiments, we show that models trained with larger weight decay values are more plastic, meaning they show larger performance gains when fine-tuned on downstream tasks. This phenomenon can lead to counterintuitive trade-offs where base models that perform worse after…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification
