Sparse Attention as Compact Kernel Regression

Saul Santos; Nuno Gon\c{c}alves; Daniel C. McNamee; Marcos Treviso; Andr\'e F.T Martins

arXiv:2601.22766·cs.LG·May 11, 2026

Sparse Attention as Compact Kernel Regression

Saul Santos, Nuno Gon\c{c}alves, Daniel C. McNamee, Marcos Treviso, Andr\'e F.T Martins

PDF

TL;DR

This paper establishes a theoretical link between sparse attention mechanisms in transformers and compact kernel regression, providing a unified framework that explains sparsity and introduces principled alternatives to heuristic methods.

Contribution

It introduces a formal correspondence between sparse attention and bounded support kernels, connecting various kernels to $ ext{α}$-entmax attention and demonstrating practical benefits.

Findings

01

Kernel-based sparse attention achieves competitive language modeling performance.

02

Unified perspective explains emergence of sparsity from kernel design.

03

Experiments validate the effectiveness of Memory Mosaics in various tasks.

Abstract

Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya-Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of sparse attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation -- including Epanechnikov, biweight, and triweight -- correspond to $α$ -entmax attention with $α = 1 + \frac{1}{n}$ for $n \in N$ , while the softmax/Gaussian relationship emerges in the limit $n \to \infty$ .…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.