Projection-Free Transformers via Gaussian Kernel Attention

Debarshi Kundu; Archisman Ghosh; Swaroop Ghosh; Vasant Honavar

arXiv:2605.02144·cs.LG·May 5, 2026

Projection-Free Transformers via Gaussian Kernel Attention

Debarshi Kundu, Archisman Ghosh, Swaroop Ghosh, Vasant Honavar

PDF

TL;DR

This paper introduces Gaussian Kernel Attention (GKA), a simpler, projection-free attention mechanism for Transformers that uses a Gaussian RBF kernel, reducing complexity while maintaining competitive performance.

Contribution

GKA replaces learned projections with a kernel-based similarity, linking Transformers to classical kernel methods and enabling a more interpretable, efficient attention mechanism.

Findings

01

GKA models with fewer parameters and FLOPs train stably and perform competitively.

02

GKA provides explicit locality scale and interpretability in attention.

03

GKA achieves comparable results with reduced computational cost.

Abstract

Self-attention in Transformers is typically implemented as $softmax (Q K^{⊤} / d) V$ , where $Q = X W_{Q}$ , $K = X W_{K}$ , and $V = X W_{V}$ are learned linear projections of the input $X$ . We ask whether these learned projections are necessary, or whether they can be replaced by a simpler similarity-based diffusion operator. We introduce \textbf{Gaussian Kernel Attention} (GKA), a drop-in replacement for dot-product attention that computes token affinities directly using a Gaussian radial basis function (RBF) kernel applied to per-head token features. Each head learns only a bandwidth parameter $σ_{h}$ , while a single output projection $W_{O}$ preserves compatibility with the standard Transformer interface. GKA can be interpreted as normalized kernel regression over tokens, linking modern Transformer architectures to classical non-local filtering and kernel smoothing methods. We evaluate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.