Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization

Abhijit Das; and Sayantan Dutta

arXiv:2605.06599·cs.LG·May 8, 2026

Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization

Abhijit Das, and Sayantan Dutta

PDF

TL;DR

This paper provides a rigorous functional-analytic framework for understanding how weight decay influences Transformer loss landscapes, linking regularization to convergence and generalization guarantees.

Contribution

It introduces a novel Villani diagnostic and theoretical analysis showing weight decay induces properties that enable faster convergence and better generalization in Transformers.

Findings

01

Experiments on GPT-Neo-125M confirm quadratic growth of the Villani diagnostic.

02

Spectral inflation of the Hessian observed in experiments.

03

Exponential convergence behavior consistent with log-Sobolev analysis.

Abstract

Weight decay is widely used as a regularizer in large language models, yet its precise role in shaping Transformer loss landscapes remains theoretically underexplored. This paper provides the first rigorous functional-analytic characterization of the standard Transformer objective--cross-entropy loss with $L^{2}$ regularization--by proving it satisfies Villani's criteria for coercive energy functions. Specifically, we show that the regularized loss $F$ is infinitely differentiable, grows at least quadratically, has Gaussian-integrable tails, and satisfies the differential growth condition $- Δ F + \frac{1}{s} ∥\nabla F ∥^{2} \to \infty$ as $∥ θ ∥ \to \infty$ for all $s > 0$ . From this structure, we derive explicit log-Sobolev and Poincar\'e constants $C_{LS} \leq λ^{- 1} + d / λ^{2}$ , linking the regularization strength $λ$ and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.