Sparse Attention as Compact Kernel Regression
Saul Santos, Nuno Gon\c{c}alves, Daniel C. McNamee, Marcos Treviso, Andr\'e F.T Martins

TL;DR
This paper establishes a theoretical link between sparse attention mechanisms in transformers and compact kernel regression, providing a unified framework that explains sparsity and introduces principled alternatives to heuristic methods.
Contribution
It introduces a formal correspondence between sparse attention and bounded support kernels, connecting various kernels to $ ext{α}$-entmax attention and demonstrating practical benefits.
Findings
Kernel-based sparse attention achieves competitive language modeling performance.
Unified perspective explains emergence of sparsity from kernel design.
Experiments validate the effectiveness of Memory Mosaics in various tasks.
Abstract
Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya-Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of sparse attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation -- including Epanechnikov, biweight, and triweight -- correspond to -entmax attention with for , while the softmax/Gaussian relationship emerges in the limit .…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
