Projection-Free Transformers via Gaussian Kernel Attention
Debarshi Kundu, Archisman Ghosh, Swaroop Ghosh, Vasant Honavar

TL;DR
This paper introduces Gaussian Kernel Attention (GKA), a simpler, projection-free attention mechanism for Transformers that uses a Gaussian RBF kernel, reducing complexity while maintaining competitive performance.
Contribution
GKA replaces learned projections with a kernel-based similarity, linking Transformers to classical kernel methods and enabling a more interpretable, efficient attention mechanism.
Findings
GKA models with fewer parameters and FLOPs train stably and perform competitively.
GKA provides explicit locality scale and interpretability in attention.
GKA achieves comparable results with reduced computational cost.
Abstract
Self-attention in Transformers is typically implemented as , where , , and are learned linear projections of the input . We ask whether these learned projections are necessary, or whether they can be replaced by a simpler similarity-based diffusion operator. We introduce \textbf{Gaussian Kernel Attention} (GKA), a drop-in replacement for dot-product attention that computes token affinities directly using a Gaussian radial basis function (RBF) kernel applied to per-head token features. Each head learns only a bandwidth parameter , while a single output projection preserves compatibility with the standard Transformer interface. GKA can be interpreted as normalized kernel regression over tokens, linking modern Transformer architectures to classical non-local filtering and kernel smoothing methods. We evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
