Infinite attention: NNGP and NTK for deep attention networks

Jiri Hron; Yasaman Bahri; Jascha Sohl-Dickstein; Roman Novak

arXiv:2006.10540·stat.ML·June 19, 2020·29 cites

Infinite attention: NNGP and NTK for deep attention networks

Jiri Hron, Yasaman Bahri, Jascha Sohl-Dickstein, Roman Novak

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper extends the theoretical understanding of wide neural networks with attention mechanisms, showing multi-head attention converges to Gaussian processes as the number of heads increases, and demonstrates practical improvements in image and sequence tasks.

Contribution

It provides a rigorous extension of NNGP and NTK theory to multi-head attention architectures, including effects of positional encoding and layer normalization, with empirical validation.

Findings

01

Multi-head attention behaves as GPs as the number of heads tends to infinity.

02

Proposed modifications improve performance of attention-based NNs.

03

Empirical evaluation shows moderate improvements on CIFAR-10 and IMDb datasets.

Abstract

There is a growing amount of literature on the relationship between wide neural networks (NNs) and Gaussian processes (GPs), identifying an equivalence between the two for a variety of NN architectures. This equivalence enables, for instance, accurate approximation of the behaviour of wide Bayesian NNs without MCMC or variational approximations, or characterisation of the distribution of randomly initialised wide NNs optimised by gradient descent without ever running an optimiser. We provide a rigorous extension of these results to NNs involving attention layers, showing that unlike single-head attention, which induces non-Gaussian behaviour, multi-head attention architectures behave as GPs as the number of heads tends to infinity. We further discuss the effects of positional encodings and layer normalisation, and propose modifications of the attention mechanism which lead to improved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google/neural-tangents
jaxOfficial

Videos

Infinite attention: NNGP and NTK for deep attention networks· slideslive

Taxonomy

TopicsGaussian Processes and Bayesian Inference · Advanced Neural Network Applications · Neural Networks and Applications

MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention