Nearly Optimal Attention Coresets
Edo Liberty, Alexandr Andoni, Eldar Kleiner

TL;DR
This paper proves the existence of nearly optimal size coresets for approximating the Attention mechanism in neural networks, improving previous bounds and establishing lower bounds.
Contribution
It introduces nearly optimal size coresets for Attention estimation, advancing the theoretical understanding of space-efficient neural network approximations.
Findings
Existence of coresets of size $O({\sqrt{d} e^{ ho+o( ho)}/\varepsilon})$ for Attention
Coresets approximate Attention within $\varepsilon$ for all bounded queries
Lower bounds show coresets must have size at least $\Omega({\sqrt{d} e^{ ho}/\varepsilon})$
Abstract
We consider the problem of estimating the Attention mechanism in small space, and prove the existence of coresets for it of nearly optimal size. Specifically, we show that for any set of unit-norm keys and values in , there exists a subset of size at most such that \[ \left\| \operatorname{Attn}(q,K,V)- \operatorname{Attn}(q,K',V') \right\| \le \varepsilon \] simultaneously for all queries whose norm is bounded by . This outperforms the best known results for this problem. We also offer an improved lower bound showing that -coresets must have size .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
