An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models
Yufeng Zhang, Boyi Liu, Qi Cai, Lingxiao Wang, Zhaoran Wang

TL;DR
This paper provides a rigorous theoretical analysis of how attention mechanisms in transformers perform relational inference and learn desirable representations through exchangeability and latent variable models, explaining their empirical success.
Contribution
It introduces a theoretical framework linking exchangeability to latent variable models, proving how attention approximates posterior inference and how training learns desirable representations.
Findings
Attention approximates the posterior distribution of latent variables.
A sufficient and minimal representation of input tokens exists.
Training objectives enable learning of the desired parameters independently of input size.
Abstract
With the attention mechanism, transformers achieve significant empirical successes. Despite the intuitive understanding that transformers perform relational inference over long sequences to produce desirable representations, we lack a rigorous theory on how the attention mechanism achieves it. In particular, several intriguing questions remain open: (a) What makes a desirable representation? (b) How does the attention mechanism infer the desirable representation within the forward pass? (c) How does a pretraining procedure learn to infer the desirable representation through the backward pass? We observe that, as is the case in BERT and ViT, input tokens are often exchangeable since they already include positional encodings. The notion of exchangeability induces a latent variable model that is invariant to input sizes, which enables our theoretical analysis. - To answer (a) on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · WordPiece · Adam · Layer Normalization · Softmax · Linear Warmup With Linear Decay · Residual Connection
