A Probabilistic Interpretation of Transformers

Alexander Shim

arXiv:2205.01080·cs.LG·May 3, 2022

A Probabilistic Interpretation of Transformers

Alexander Shim

PDF

Open Access

TL;DR

This paper offers a probabilistic perspective on transformer attention, linking it to exponential families and Hopfield networks, and discusses theoretical limitations and future directions.

Contribution

It introduces a probabilistic interpretation of transformer attention as a gradient ascent process within exponential families, connecting it to Hopfield theory.

Findings

01

Attention corresponds to gradient ascent on the log-normalizer.

02

Layer normalization balances point expansion during attention.

03

Theoretical limitations of the current interpretation are identified.

Abstract

We propose a probabilistic interpretation of exponential dot product attention of transformers and contrastive learning based off of exponential families. The attention sublayer of transformers is equivalent to a gradient ascent step of the log normalizer, which is the log-sum-exp term in the Hopfield theory of attention. This ascent step induces a parallel expansion of points, which is counterbalanced by a contraction from layer normalization. We also state theoretical limitations of our theory and the Hopfield theory and suggest directions for resolution.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsContrastive Learning