Weight Decay Improves Language Model Plasticity

Tessa Han; Sebastian Bordt; Hanlin Zhang; Sham Kakade

arXiv:2602.11137·cs.LG·February 12, 2026

Weight Decay Improves Language Model Plasticity

Tessa Han, Sebastian Bordt, Hanlin Zhang, Sham Kakade

PDF

Open Access

TL;DR

This paper investigates how weight decay during pretraining influences language model plasticity, showing that higher weight decay enhances downstream adaptability by promoting more linearly separable representations and better regularization.

Contribution

It reveals that larger weight decay values improve model plasticity, leading to better fine-tuning performance, and highlights the importance of evaluation metrics beyond validation loss.

Findings

01

Models with larger weight decay are more adaptable to downstream tasks.

02

Weight decay encourages linearly separable representations and regularizes attention matrices.

03

Pretraining with higher weight decay can improve fine-tuning outcomes despite worse initial performance.

Abstract

The prevailing paradigm in large language model (LLM) development is to pretrain a base model, then perform further training to improve performance and model behavior. However, hyperparameter optimization and scaling laws have been studied primarily from the perspective of the base model's validation loss, ignoring downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks through fine-tuning. We focus on the role of weight decay, a key regularization parameter during pretraining. Through systematic experiments, we show that models trained with larger weight decay values are more plastic, meaning they show larger performance gains when fine-tuned on downstream tasks. This phenomenon can lead to counterintuitive trade-offs where base models that perform worse after…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification