The Lipschitz Constant of Self-Attention
Hyunjik Kim, George Papamakarios, Andriy Mnih

TL;DR
This paper analyzes the Lipschitz properties of self-attention mechanisms, proving standard dot-product self-attention is not Lipschitz, proposing an L2 variant that is, and demonstrating its practical use in invertible Transformer models for language modeling.
Contribution
It introduces an L2 self-attention that is Lipschitz, provides bounds on its Lipschitz constant, and applies invertible self-attention in Transformer architectures.
Findings
Standard dot-product self-attention is not Lipschitz for unbounded inputs.
L2 self-attention can be made Lipschitz with a derived upper bound.
Invertible self-attention improves Transformer-based language modeling.
Abstract
Lipschitz constants of neural networks have been explored in various contexts in deep learning, such as provable adversarial robustness, estimating Wasserstein distance, stabilising training of GANs, and formulating invertible neural networks. Such works have focused on bounding the Lipschitz constant of fully connected or convolutional networks, composed of linear maps and pointwise non-linearities. In this paper, we investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence modelling. We prove that the standard dot-product self-attention is not Lipschitz for unbounded input domain, and propose an alternative L2 self-attention that is Lipschitz. We derive an upper bound on the Lipschitz constant of L2 self-attention and provide empirical evidence for its asymptotic tightness. To demonstrate the practical relevance of our theoretical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMental Health Research Topics · Cognitive Science and Education Research · Opinion Dynamics and Social Influence
