Laplacian Heads Improve Transformers by Smoothing Token Representations

Yuchong Zhang; Vardan Papyan

arXiv:2602.09297·cs.LG·May 18, 2026

Laplacian Heads Improve Transformers by Smoothing Token Representations

Yuchong Zhang, Vardan Papyan

PDF

TL;DR

This paper introduces Laplacian heads in Transformers, replacing some attention matrices with graph Laplacians to enhance token representation smoothing and improve performance across various tasks.

Contribution

The authors propose a novel modification to Transformer attention heads using graph Laplacians, enabling better control of token representation variance and improving learning outcomes.

Findings

01

Laplacian heads collapse token representations within sequences.

02

They increase the separability of token representations sharing the same next token.

03

They lead to faster-decaying spectra, indicating stronger token smoothing.

Abstract

Transformers update token representations through multi-head attention and residual connections as $X \leftarrow X + \sum_{i} P^{(i)} X W_{V_{i}} W_{o_{i}}$ , where $P^{(i)}$ is the softmax attention matrix in head $i$ . We propose replacing a subset of $P^{(i)}$ 's with the Laplacian $I - P^{(i)}$ , giving $X \leftarrow X + \sum_{i \in A} P^{(i)} X W_{V_{i}} W_{o_{i}} + \sum_{i \in L} (I - P^{(i)}) X W_{V_{i}} W_{o_{i}}$ . Our proposal has two motivations. First, it allows attention heads to update the mean of token representations, while Laplacian heads can directly control within-sequence variance. Second, if tokens are viewed as nodes in a graph with edge weights $P^{(i)}$ , then $I - P^{(i)}$ is the corresponding graph Laplacian, and the update can be interpreted as one step of heat diffusion on the graph. We show that this simple modification improves performance across supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.