An Analysis of Attention via the Lens of Exchangeability and Latent   Variable Models

Yufeng Zhang; Boyi Liu; Qi Cai; Lingxiao Wang; Zhaoran Wang

arXiv:2212.14852·cs.LG·April 2, 2024

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

Yufeng Zhang, Boyi Liu, Qi Cai, Lingxiao Wang, Zhaoran Wang

PDF

Open Access

TL;DR

This paper provides a rigorous theoretical analysis of how attention mechanisms in transformers perform relational inference and learn desirable representations through exchangeability and latent variable models, explaining their empirical success.

Contribution

It introduces a theoretical framework linking exchangeability to latent variable models, proving how attention approximates posterior inference and how training learns desirable representations.

Findings

01

Attention approximates the posterior distribution of latent variables.

02

A sufficient and minimal representation of input tokens exists.

03

Training objectives enable learning of the desired parameters independently of input size.

Abstract

With the attention mechanism, transformers achieve significant empirical successes. Despite the intuitive understanding that transformers perform relational inference over long sequences to produce desirable representations, we lack a rigorous theory on how the attention mechanism achieves it. In particular, several intriguing questions remain open: (a) What makes a desirable representation? (b) How does the attention mechanism infer the desirable representation within the forward pass? (c) How does a pretraining procedure learn to infer the desirable representation through the backward pass? We observe that, as is the case in BERT and ViT, input tokens are often exchangeable since they already include positional encodings. The notion of exchangeability induces a latent variable model that is invariant to input sizes, which enables our theoretical analysis. - To answer (a) on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · WordPiece · Adam · Layer Normalization · Softmax · Linear Warmup With Linear Decay · Residual Connection