Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers
Maksim Velikanov, Ilyas Chahed, Jingwei Zuo, Dhia Eddine Rhaiem, Younes Belkada, Hakim Hacid

TL;DR
This paper introduces learnable scalar multipliers for matrix layers in language models to optimize weight scale, outperforming fixed scaling methods and improving downstream task performance.
Contribution
It proposes a novel learnable scaling approach for matrix layers, generalizing muP multipliers, and demonstrates performance gains over fixed and baseline methods.
Findings
Learnable scalers adapt to data and improve model performance.
Outperforms well-tuned muP baseline in experiments.
Shows consistent improvements with different optimizers.
Abstract
Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducing learnable multipliers to learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that the WD-noise equilibrium norm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications
