Training Transformers with Enforced Lipschitz Constants
Laker Newhouse, R. Preston Hess, Franz Cesista, Andrii Zahorodnii, Jeremy Bernstein, Phillip Isola

TL;DR
This paper develops efficient tools to train large transformer models with enforced Lipschitz constraints, improving robustness and stability while exploring the trade-offs between Lipschitz bounds and model performance.
Contribution
It introduces novel methods for maintaining norm-constrained weights in transformers, enabling training with Lipschitz bounds beyond initialization, and demonstrates their effects on model accuracy and stability.
Findings
Lipschitz bounds can be enforced during training of transformers.
Optimizer choice significantly impacts Lipschitz-constrained training.
Trade-off observed between Lipschitz constant and model accuracy.
Abstract
Neural networks are often highly sensitive to input and weight perturbations. This sensitivity has been linked to pathologies such as vulnerability to adversarial examples, divergent training, and overfitting. To combat these problems, past research has looked at building neural networks entirely from Lipschitz components. However, these techniques have not matured to the point where researchers have trained a modern architecture such as a transformer with a Lipschitz certificate enforced beyond initialization. To explore this gap, we begin by developing and benchmarking novel, computationally-efficient tools for maintaining norm-constrained weight matrices. Applying these tools, we are able to train transformer models with Lipschitz bounds enforced throughout training. We find that optimizer dynamics matter: switching from AdamW to Muon improves standard methods -- weight decay and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
