Laplacian Heads Improve Transformers by Smoothing Token Representations
Yuchong Zhang, Vardan Papyan

TL;DR
This paper introduces Laplacian heads in Transformers, replacing some attention matrices with graph Laplacians to enhance token representation smoothing and improve performance across various tasks.
Contribution
The authors propose a novel modification to Transformer attention heads using graph Laplacians, enabling better control of token representation variance and improving learning outcomes.
Findings
Laplacian heads collapse token representations within sequences.
They increase the separability of token representations sharing the same next token.
They lead to faster-decaying spectra, indicating stronger token smoothing.
Abstract
Transformers update token representations through multi-head attention and residual connections as , where is the softmax attention matrix in head . We propose replacing a subset of 's with the Laplacian , giving . Our proposal has two motivations. First, it allows attention heads to update the mean of token representations, while Laplacian heads can directly control within-sequence variance. Second, if tokens are viewed as nodes in a graph with edge weights , then is the corresponding graph Laplacian, and the update can be interpreted as one step of heat diffusion on the graph. We show that this simple modification improves performance across supervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
