Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization
Abhijit Das, and Sayantan Dutta

TL;DR
This paper provides a rigorous functional-analytic framework for understanding how weight decay influences Transformer loss landscapes, linking regularization to convergence and generalization guarantees.
Contribution
It introduces a novel Villani diagnostic and theoretical analysis showing weight decay induces properties that enable faster convergence and better generalization in Transformers.
Findings
Experiments on GPT-Neo-125M confirm quadratic growth of the Villani diagnostic.
Spectral inflation of the Hessian observed in experiments.
Exponential convergence behavior consistent with log-Sobolev analysis.
Abstract
Weight decay is widely used as a regularizer in large language models, yet its precise role in shaping Transformer loss landscapes remains theoretically underexplored. This paper provides the first rigorous functional-analytic characterization of the standard Transformer objective--cross-entropy loss with regularization--by proving it satisfies Villani's criteria for coercive energy functions. Specifically, we show that the regularized loss is infinitely differentiable, grows at least quadratically, has Gaussian-integrable tails, and satisfies the differential growth condition as for all . From this structure, we derive explicit log-Sobolev and Poincar\'e constants , linking the regularization strength and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
