Solving Attention Kernel Regression Problem via Pre-conditioner
Zhao Song, Junze Yin, Lichen Zhang

TL;DR
This paper introduces fast algorithms for approximating attention matrices in large language models by solving specialized regression problems using sketching and preconditioning techniques.
Contribution
It proposes novel regression algorithms for matrix exponential proxies of attention matrices, enabling more efficient computation in large-scale models.
Findings
Developed algorithms for regression with matrix exponential proxies
Designed methods for exponential entrywise applied to Gram matrices
Provided efficient approximation techniques for attention matrices
Abstract
The attention mechanism is the key to large language models, and the attention matrix serves as an algorithmic and computational bottleneck for such a scheme. In this paper, we define two problems, motivated by designing fast algorithms for proxy of attention matrix and solving regressions against them. Given an input matrix with and a response vector , we first consider the matrix exponential of the matrix as a proxy, and we in turn design algorithms for two types of regression problems: and for any positive integer . Studying algorithms for these regressions is essential, as matrix exponential can be approximated term-by-term via these smaller problems. The second proxy is applying exponential entrywise to the Gram matrix, denoted by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Graphene research and applications
