Pay Attention to Attention Distribution: A New Local Lipschitz Bound for Transformers
Nikolay Yudin, Alexander Gaponov, Sergei Kudriashov, Maxim Rakhuba

TL;DR
This paper introduces a new local Lipschitz bound for transformer self-attention, revealing how attention distributions influence robustness and proposing a regularization method to improve it.
Contribution
It provides a more accurate Lipschitz bound for transformers and introduces JaSMin, a regularization technique to enhance robustness by reducing Lipschitz constants.
Findings
The new bound is more accurate than previous estimates.
Attention score distributions significantly affect Lipschitz constants.
JaSMin regularization improves transformer robustness.
Abstract
We present a novel local Lipschitz bound for self-attention blocks of transformers. This bound is based on a refined closed-form expression for the spectral norm of the softmax function. The resulting bound is not only more accurate than in the prior art, but also unveils the dependence of the Lipschitz constant on attention score maps. Based on the new findings, we suggest an explanation of the way distributions inside the attention map affect the robustness from the Lipschitz constant perspective. We also introduce a new lightweight regularization term called JaSMin (Jacobian Softmax norm Minimization), which boosts the transformer's robustness and decreases local Lipschitz constants of the whole network.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · Adversarial Robustness in Machine Learning
