Transformers as Unrolled Inference in Probabilistic Laplacian Eigenmaps: An Interpretation and Potential Improvements
Aditya Ravuri, Neil D. Lawrence

TL;DR
This paper offers a probabilistic interpretation of transformers as unrolled inference in Laplacian Eigenmaps, revealing their initial linear reduction and proposing a simple modification that improves performance in language and vision tasks.
Contribution
It introduces a novel probabilistic perspective on transformers, connecting them to Laplacian Eigenmaps and suggesting a simple yet effective modification for better performance.
Findings
Transformers perform initial linear dimensionality reduction.
A graph Laplacian term naturally arises within transformer blocks.
Subtracting the identity from attention improves validation performance.
Abstract
We propose a probabilistic interpretation of transformers as unrolled inference steps assuming a probabilistic Laplacian Eigenmaps model from the ProbDR framework. Our derivation shows that at initialisation, transformers perform "linear" dimensionality reduction. We also show that within the transformer block, a graph Laplacian term arises from our arguments, rather than an attention matrix (which we interpret as an adjacency matrix). We demonstrate that simply subtracting the identity from the attention matrix (and thereby taking a graph diffusion step) improves validation performance on a language model and a simple vision transformer.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Multimodal Machine Learning Applications
