Pay Attention to Attention Distribution: A New Local Lipschitz Bound for Transformers

Nikolay Yudin; Alexander Gaponov; Sergei Kudriashov; Maxim Rakhuba

arXiv:2507.07814·cs.LG·July 11, 2025

Pay Attention to Attention Distribution: A New Local Lipschitz Bound for Transformers

Nikolay Yudin, Alexander Gaponov, Sergei Kudriashov, Maxim Rakhuba

PDF

Open Access

TL;DR

This paper introduces a new local Lipschitz bound for transformer self-attention, revealing how attention distributions influence robustness and proposing a regularization method to improve it.

Contribution

It provides a more accurate Lipschitz bound for transformers and introduces JaSMin, a regularization technique to enhance robustness by reducing Lipschitz constants.

Findings

01

The new bound is more accurate than previous estimates.

02

Attention score distributions significantly affect Lipschitz constants.

03

JaSMin regularization improves transformer robustness.

Abstract

We present a novel local Lipschitz bound for self-attention blocks of transformers. This bound is based on a refined closed-form expression for the spectral norm of the softmax function. The resulting bound is not only more accurate than in the prior art, but also unveils the dependence of the Lipschitz constant on attention score maps. Based on the new findings, we suggest an explanation of the way distributions inside the attention map affect the robustness from the Lipschitz constant perspective. We also introduce a new lightweight regularization term called JaSMin (Jacobian Softmax norm Minimization), which boosts the transformer's robustness and decreases local Lipschitz constants of the whole network.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · Adversarial Robustness in Machine Learning