Infinite attention: NNGP and NTK for deep attention networks
Jiri Hron, Yasaman Bahri, Jascha Sohl-Dickstein, Roman Novak

TL;DR
This paper extends the theoretical understanding of wide neural networks with attention mechanisms, showing multi-head attention converges to Gaussian processes as the number of heads increases, and demonstrates practical improvements in image and sequence tasks.
Contribution
It provides a rigorous extension of NNGP and NTK theory to multi-head attention architectures, including effects of positional encoding and layer normalization, with empirical validation.
Findings
Multi-head attention behaves as GPs as the number of heads tends to infinity.
Proposed modifications improve performance of attention-based NNs.
Empirical evaluation shows moderate improvements on CIFAR-10 and IMDb datasets.
Abstract
There is a growing amount of literature on the relationship between wide neural networks (NNs) and Gaussian processes (GPs), identifying an equivalence between the two for a variety of NN architectures. This equivalence enables, for instance, accurate approximation of the behaviour of wide Bayesian NNs without MCMC or variational approximations, or characterisation of the distribution of randomly initialised wide NNs optimised by gradient descent without ever running an optimiser. We provide a rigorous extension of these results to NNs involving attention layers, showing that unlike single-head attention, which induces non-Gaussian behaviour, multi-head attention architectures behave as GPs as the number of heads tends to infinity. We further discuss the effects of positional encodings and layer normalisation, and propose modifications of the attention mechanism which lead to improved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGaussian Processes and Bayesian Inference · Advanced Neural Network Applications · Neural Networks and Applications
MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention
